═══ 1. Disclaimer ═══ The information contained in this file was obtained from a Bulletin Board System owned and operated by IBM. However, this is no guarantee of its accuracy. As a result, I can make no warranty of its usefulness to you. Moreover, there is the possibility that IBM will change these codes over time, although the probability of this is infinitessimal. Notwithstanding the above, I have found the information useful. David W. Noon. ═══ 2. An Overview of Traps, by Denis Tonn ═══ There has been a lot of queries and discussion about "Traps" under OS/2. Rather than try and explain each time the topic comes up, I have put together the following that hopefully will help users understand the basics. The following is essentially accurate, although incomplete in minor details and (hopefully!) substantially simplified. ═══ 2.1. What is a "Trap"? ═══ There is a class of CPU detected conditions on Intel processors called "Machine Exceptions". These exceptions are grouped into different "types", and one of these types is called TRAP. Other types are INT, FAULT, ABORT, and DR6. Because the majority of the types of Machine Exceptions are of type TRAP, common usage is to call all exceptions "Traps". ═══ 2.2. What causes a "Trap"? ═══ Rather than try and detail all possible causes, I will try and focus on a few of the more common ones encountered. For full details on all possible causes, refer to the Intel Programmer's Reference. A little background is required before getting into possible causes and what is "important" about the types of information displayed. The current Intel CPU's have something called "protected mode" that they can run in. This mode sets up some "tables" that the CPU refers to for access to memory addresses. These tables include information about any application's ability to "access" a particular address. Addresses can be "protected" from particular kinds of access. Some addresses can be marked as read only (cannot be written to), some addresses are reserved for higher privileged code (Priv level 0 or Ring 0 as it is usually called). The address can even be marked as totally invalid (no access allowed at all). When the CPU encounters any problems with access to an address it will generate an exception and transfer control to a "vector" which is usually the address of OS supplied code which then attempts to "handle" the exception in some fashion. ═══ 2.3. Trap D ═══ In it's essence, a Trap D is a violation of any of the protection implemented in the Descriptor Table(s). The CPU will detect this violation and generate a Machine Exception D (this is type FAULT, so "Trap D" is a misnomer). The operating system will gain control (the CPU does this automatically) and can take actions at this point in time. I will discuss the possible actions a little further down, focusing specifically on OS/2. ═══ 2.4. Trap C ═══ This is a "special case" of an Exception D (also of type FAULT) relating to the use of CPU registers called the Stack Selector and Stack Pointer. Suffice it to say that the code has run out of temporary storage space addressed by these 2 registers. Again, the CPU will transfer control to the OS when it encounters this condition. ═══ 2.5. Trap E ═══ Similar to Trap D (it's also of type FAULT), but relating to the protection mechanisms implemented in another set of tables called "Page Tables". This is a second level of "virtual addressing and protection" implemented in 386 (and up) Intel CPU's. Again, the OS is given control when the CPU encounters a problem using an address through the page tables. Interestingly, this is the mechanism that OS/2 2.0 (and higher) uses to implement "memory overcommit" which allows you to run programs which would use more RAM than you actually have installed. In essence, OS/2 uses swapper.dat as a "temporary" place to to store the "memory" that the programs are accessing and moves it into (and out of) RAM as required. When the "memory at this address" is stored in swapper.dat, the page table entry (that the CPU uses) will be marked as "not resident" (essentially invalid as far as the CPU is concerned) and OS/2 will swap it in as required. More on this later.. ═══ 2.6. Trap 8 ═══ If the CPU cannot "transfer control" to an OS supplied vector to "handle" one of the above exceptions, then it generates an Exception 8 (ABORT). This is a pretty catastrophic condition as far as the system is concerned, and the OS supplied handler for Trap 8 will halt the system completely. ═══ 2.7. Trap 3 ═══ This is a special Trap (and yes, it is of type "Trap") that is used by programmers to debug their code. Sometimes these are "left in" accidentally or purposely, as a "die now" case when the code cannot logically do anything else. ═══ 2.8. OK, how does OS/2 handle CPU generated exceptions? ═══ Keep in mind that OS/2 will gain control after the CPU detects these conditions. The first thing OS/2 will do is take a look at the "cause" of the CPU detected "trap". If it can do anything about correcting the condition, it will do so. If this is a Trap E caused by the fact that the memory at this "address" is out on the disk, then it will find (or make) a spare RAM page to bring the "memory" into, update the page tables for this address to point to this RAM location, and cause the CPU to restart the instruction that "failed". In this case, neither the application nor the user even knows that a "trap" occurred. Most Trap E's are handled in this fashion. If it cannot "fix" the "failing address" in the above fashion because it does not "know" what should be at this address or if the app has never "allocated" the address, it then passes control to any registered "exception handlers" that then take a crack at "fixing" the address somehow. If any of these manage to fix it up, then OS/2 restarts the instruction just as in the previous case. Again, the user usually does not know that a "trap" occurred, although possibly an application level exception handler might have taken note of it. It is quite possible to have one of these exception handlers subsequently cause a "trap", and then we start at the top of this flow with the "new" trap information. If neither the OS swap handler nor an exception handler can do anything to "fix up" the memory address, then OS/2 has no choice but to "kill" the process involved. If it is an application that contains the failing instruction, then OS/2 will pop up an "exception screen" telling the user what happened and offering to give more information (registers) before "ending" the application. This is an "application trap" and is usually accompanied by a sys317x message. The same situation(s) could occur in privilege level 0 code (kernel, device driver, file system driver, etc) and in that case there is no "process" that can be "ended" so the whole system will "halt" giving the user information (registers) on the screen as a last gasp before it "halts". This is called an Internal Process Error (IPE) or as some choose to call it "the black screen of death with meaningless numbers displayed". The system may be "dead", but the numbers and letters can be quite meaningful to those with the right knowledge and tools. Note: There is a "halt" routine in the kernel, and it can be called from various places, including the routine that formats and displays the register information from traps. This halt routine will "report" the location that called it. In the case of "traps" the halt routine will always report the address of the "register display" routine. Useful, but the "address" displayed is redundant. The halt routine also displays the level of the system and CPU type. If there is formatted register information then that is usually the "cause" of the problem and should not be ignored when reporting such. No matter if the "trap failure" is in application or privileged code, it is important to gather all the information available. It is the starting point for the programmer to locate the problem. It may not be enough to fully "find" the problem, but you have to start somewhere. General users should not be making decisions on what is "important" from these displays, since the process of problem analysis may use none, any, or all of this information. ═══ 2.9. Hardware or Software, and what do I do? ═══ All the previous discussion seems to be oriented to a "software" explanation of the causes of "traps". This is because a "trap" is detected by the CPU during it's execution of CPU instructions. It presupposes that the tables, instructions, data/code pointers, etc are operating correctly in the hardware, which it generally does. But if the hardware has a flaw that causes any of these values to be "incorrect" then the symptoms will be exactly the same (as far as the CPU is concerned) as if it was a "software" failure. Sometimes an external piece of hardware can cause a condition (status) that the software "driver" never expected (because it will never occur on correctly working hardware) which in turn causes the "driver" to "trap" (possibility of trap 3). There are all kinds of possibilities and even "combinations" of software and hardware that can be related to a "trap". There is no hard and fast rule to know which is the problem. But there is some methodology that can be applied to make "guesses" about the causes, which can then be "tested" with some "actions".  If you have 2 systems with the same hardware and same software setup, and it fails on one but not the other then you are probably looking at a hardware failure.  If you have multiple systems of different hardware and setups displaying the same types of failures, it is probably software. (The difficulty arises in defining what "same types of failures" means. This presupposes a solid understanding of "failure types" beyond the scope of this simple explanation. It is a lot more than a simple "trap D" type of description.)  If the symptom seems to "make sense" from a (larger) software point of view, then pursue it as such. If the symptoms do not make "software sense", or if you have a range of symptoms that do not seem to follow a discernible pattern then suspect hardware.  Do NOT assume that hardware is working correctly because it runs diagnostics or worked with another OS/driver/version/etc. Any changes to a system can expose potential hardware problems and diagnostics rarely find intermittent problems.  Always keep in mind that you may be going down the wrong path in your diagnosis and "double check" yourself fairly frequently.  Gather as much information as you can about the problem, including full "trap register" data. Document step by step descriptions of the way you "create" the problem if at all possible. Document as much about your hardware as you can. Document as much about your system setup as you can. Be prepared to offer any (or all) this information to anyone that is assisting you. You don't have to offer it all at once, but be prepared to do so.  Test any "guesses" with actions. If you suspect any hardware, swap or replace it. Try slowing down memory access timing, or turning off CPU cache (BIOS setup). DMA and Bus timing can also be involved. If you suspect software, then replace it with the latest versions and see if it still fails. Gather the same data (as above) and report your problem. If others with the same (or similar) hardware and software setups cannot duplicate your symptoms than either you need to supply more data to them or you possibly have a hardware problem.  If the problem is a trap that generates an IPE (and a halted system) then it may be possible for an analyst to get a pretty good idea of the problem purely from the full data on the "trap screen". All of the data may be significant, so don't "skip anything". If the full screen of data is available, then an analyst can usually decide if the trap was detected inside OS/2 code or in a device/FS driver.  In many cases, the trap screen alone may not be enough and further data gathering is required. System traces, a system Dump, or installation of a special "debug kernel" for "live debugging" of the problem are possibilities. There is much more that could be discussed or recommended for any particular problem, but the above should suffice to get most people started in a logical approach to finding the cause of a trap. Remember that anyone trying to assist you cannot just "pull a rabbit out of the hat", they have to have some solid data to work with for a starting point. Frequently others have encountered the same problem and have already found the solution. Other people can try the same "tests" that you are doing which will give you an idea if the problem is widespread (software) or unique to you (probably hardware). The more complete you make your description, the more "accurate" the advice you will receive. Oh, one last part to all this. An application (and the system for that matter) consists of multiple modules all working together. You may have a case of one module trying to "use" information supplied by a different module and thereby generating a "trap". The "failing module" is not the one that used the bad information (from the popup screen), but the one that generated it. Deep analysis of this kind of problem cannot be done from trap screen data alone and usually requires further data gathering (as described above). ═══ 3. Trap Errors ═══ The next sections classify and explain the Trap codes reported by OS/2. ═══ 3.1. Summary of Trap Errors in OS/2 ═══ The following is a summary of each trap code, listed in numerical order. ┌───────┬────┬───────────────────────────────────┐ │Decimal│Hex │Description │ ├───────┼────┼───────────────────────────────────┤ │00 │0000│Divide by zero error │ ├───────┼────┼───────────────────────────────────┤ │01 │0001│Debug Exception │ ├───────┼────┼───────────────────────────────────┤ │02 │0002│Non-Maskable Interrupt (NMI) │ ├───────┼────┼───────────────────────────────────┤ │03 │0003│Debug Breakpoint │ ├───────┼────┼───────────────────────────────────┤ │04 │0004│Overflow Detected │ ├───────┼────┼───────────────────────────────────┤ │05 │0005│Bound Range Exceeded │ ├───────┼────┼───────────────────────────────────┤ │06 │0006│Invalid Opcode Instruction │ ├───────┼────┼───────────────────────────────────┤ │07 │0007│Coprocessor not Available │ ├───────┼────┼───────────────────────────────────┤ │08 │0008│Double Fault │ ├───────┼────┼───────────────────────────────────┤ │09 │0009│Coprocessor Segment Overrun │ ├───────┼────┼───────────────────────────────────┤ │10 │000A│Invalid Task State Segment │ ├───────┼────┼───────────────────────────────────┤ │11 │000B│Segment not Available │ ├───────┼────┼───────────────────────────────────┤ │12 │000C│Stack Fault │ ├───────┼────┼───────────────────────────────────┤ │13 │000D│General Protection Fault │ ├───────┼────┼───────────────────────────────────┤ │14 │000E│Page Fault │ ├───────┼────┼───────────────────────────────────┤ │15 │000F│Reserved by Intel │ ├───────┼────┼───────────────────────────────────┤ │16 │0010│Coprocessor Error │ └───────┴────┴───────────────────────────────────┘ ═══ 3.2. Explanations of Trap Codes ═══ A trap 0000 occurs when a program attempts to divide a number by zero or the result of the operation is too large for the overflow register to handle it. [SYS1930] A trap 0001 is caused when a program enables the single step interrupt when not being run by a debugger. [SYS1931] A trap 0002 is caused when an Non-Masked Interrupt (NMI) is generated by the system for a catastrophic error. Four possible causes of this are: Error Cause 110 Planar parity error: memory or system board 111 I/O parity error: memory adapter or memory 112 Watchdog time-out: any adapter, system board 113 DMA arbitration time-out: any adapter, system board A trap 0003 is caused when the program called an INT3 without being run by debug. This happened because debugging code was left in the program either accidentally or by design. [SYS1933] A trap 0004 is caused when a program started an INTO instruction without registering an overflow exception handler. [SYS1934] A trap 0005 is caused when a program started a BOUND instruction without registering a bound exception handler. [SYS1935] A trap 0006 is caused when a program started an invalid instruction without registering an invalid opcode exception handler. [SYS1936] A trap 0007 is caused when a program called for a numeric coprocessor instruction without a coprocessor in the system and without registering a processor extension not available exception handler. [SYS1937] A trap 0008 is caused when the processor detects an exception while processing another exception. [SYS1938] A trap 0009 is caused when a program runs a numeric coprocessor instruction that tries to read or write past the end of the storage segment. [SYS1939] A trap 000A is caused when a program attempts a task switch to an invalid task switch segment. [SYS1940] A trap 000B is caused when a program attempts to reference a memory segment that isn't present. [SYS1941] A trap 000C is caused when a program attempts to push more data onto the stack than it can hold, call too many subroutines, take more data off the stack than was pushed onto it or return more subroutines than were called. [SYS1942] A trap 000D is caused (but not limited to) when a program references storage outside the limit of the memory segment, references a storage segment that is restricted to privileged code, references storage with a selector value of zero, writing read-only memory or code segment, reading from an execute-only code segment or loading an invalid value into a selector register. [SYS1943] A trap 000E is caused when a page being referenced is not present in memory, the procedure referencing the page doesn't have enough privilege to access the page or the address range was allocated but no storage is committed. A trap 000F is reserved by Intel. It's not for our use. A trap 0010 is caused when the processor detects an error from the coprocessor, either by hardware or software. ═══ 3.3. Internal Processing Errors ═══ Although not exactly a trap error, an Internal Processing Error (IPE) is associated with a SYS1915 error, which can be caused by the same conditions as a trap error and subsequently are solved in the same manner. The key to determining the type of error is in interpreting the error message. A message "The system has detected an internal processing error at location ##0160:FFF6FC01-000D:00015C01....." would tell you that you should follow the trouble shooting procedures for a trap 000D error. Locate the error message by finding the three zeros followed by a letter or number (...01-000D:00...). In the IPE message, it will generally fall after the first cluster of digits. OS/2 FixPak14 added to the information available for a Trap 0002 by giving additional information in the ErrCode field. This is explained in the following. PROBLEM CONCLUSION: This problem has been corrected in FixPak 14 for Warp. The ERRCD field will now contain the values from the hardware port 0x61 and on EISA bus will also read port 0x461. The following codes represent the indicated actions needed: 0000 Software caused NMI 0001 RAM error, check memory (parity error) 0002 Adapter caused error (I/O channel check) 0003 Check bus mastering adapters, update adapter drivers. Also disable bus mastering on failing adapters as a problem determination tool to figure out which adapter is causing the failure. Contact the adapter vendor for further assistance. (DMA timeout) 0004 A device driver or Dos application disabled interrupts too long. Contact appropriate software vendor for updated software (Watchdog timeout). 0005 Contact application or device driver hardware vendor (software generated NMI). 0006 see error code 0003. 0007 see error code 0004. 0008 see error code 0002. 0009 see error code 0001. ═══ 3.3.1. Related Error Messages ═══ The following built in OS/2 error messages can be queried to see additional help for the error codes: For this ERRCD: Type HELP to get more help 0000 1944 0001 1945 0002 1946 0003 1947 0004 1948 0005 3140 0006 3141 0007 3142 0008 3143 0009 3144 (Example: Typing HELP 1945 from an OS/2 command prompt will provide additional useful information about the failure that occurred.)