═══ 1. Disclaimer ═══

The information contained in this file was obtained from a Bulletin Board 
System owned and operated by IBM. However, this is no guarantee of its 
accuracy. As a result, I can make no warranty of its usefulness to you. 

Moreover, there is the possibility that IBM will change these codes over time, 
although the probability of this is infinitessimal. 

Notwithstanding the above, I have found the information useful. 

David W. Noon. 


═══ 2. An Overview of Traps, by Denis Tonn ═══

There has been a lot of queries and discussion about "Traps" under OS/2. Rather 
than try and explain each time the topic comes up, I have put together the 
following that hopefully will help users understand the basics. The following 
is essentially accurate, although incomplete in minor details and (hopefully!) 
substantially simplified. 


═══ 2.1. What is a "Trap"? ═══

There is a class of CPU detected conditions on Intel processors called "Machine 
Exceptions". These exceptions are grouped into different "types", and one of 
these types is called TRAP. Other types are INT, FAULT, ABORT, and DR6. Because 
the majority of the types of Machine Exceptions are of type TRAP, common usage 
is to call all exceptions "Traps". 


═══ 2.2. What causes a "Trap"? ═══

Rather than try and detail all possible causes, I will try and focus on a few 
of the more common ones encountered. For full details on all possible causes, 
refer to the Intel Programmer's Reference. 

A little background is required before getting into possible causes and what is 
"important" about the types of information displayed. 

The current Intel CPU's have something called "protected mode" that they can 
run in. This mode sets up some "tables" that the CPU refers to for access to 
memory addresses. These tables include information about any application's 
ability to "access" a particular address. Addresses can be "protected" from 
particular kinds of access. Some addresses can be marked as read only (cannot 
be written to), some addresses are reserved for higher privileged code (Priv 
level 0 or Ring 0 as it is usually called). The address can even be marked as 
totally invalid (no access allowed at all). 

When the CPU encounters any problems with access to an address it will generate 
an exception and transfer control to a "vector" which is usually the address of 
OS supplied code which then attempts to "handle" the exception in some fashion. 


═══ 2.3. Trap D ═══

In it's essence, a Trap D is a violation of any of the protection implemented 
in the Descriptor Table(s). The CPU will detect this violation and generate a 
Machine Exception D (this is type FAULT, so "Trap D" is a misnomer). The 
operating system will gain control (the CPU does this automatically) and can 
take actions at this point in time. I will discuss the possible actions a 
little further down, focusing specifically on OS/2. 


═══ 2.4. Trap C ═══

This is a "special case" of an Exception D (also of type FAULT) relating to the 
use of CPU registers called the Stack Selector and Stack Pointer. Suffice it to 
say that the code has run out of temporary storage space addressed by these 2 
registers. Again, the CPU will transfer control to the OS when it encounters 
this condition. 


═══ 2.5. Trap E ═══

Similar to Trap D (it's also of type FAULT), but relating to the protection 
mechanisms implemented in another set of tables called "Page Tables". This is a 
second level of "virtual addressing and protection" implemented in 386 (and up) 
Intel CPU's. Again, the OS is given control when the CPU encounters a problem 
using an address through the page tables. 

Interestingly, this is the mechanism that OS/2 2.0 (and higher) uses to 
implement "memory overcommit" which allows you to run programs which would use 
more RAM than you actually have installed. In essence, OS/2 uses swapper.dat as 
a "temporary" place to to store the "memory" that the programs are accessing 
and moves it into (and out of) RAM as required. When the "memory at this 
address" is stored in swapper.dat, the page table entry (that the CPU uses) 
will be marked as "not resident" (essentially invalid as far as the CPU is 
concerned) and OS/2 will swap it in as required. More on this later.. 


═══ 2.6. Trap 8 ═══

If the CPU cannot "transfer control" to an OS supplied vector to "handle" one 
of the above exceptions, then it generates an Exception 8 (ABORT). This is a 
pretty catastrophic condition as far as the system is concerned, and the OS 
supplied handler for Trap 8 will halt the system completely. 


═══ 2.7. Trap 3 ═══

This is a special Trap (and yes, it is of type "Trap") that is used by 
programmers to debug their code. Sometimes these are "left in" accidentally or 
purposely, as a "die now" case when the code cannot logically do anything else. 


═══ 2.8. OK, how does OS/2 handle CPU generated exceptions? ═══

Keep in mind that OS/2 will gain control after the CPU detects these 
conditions. The first thing OS/2 will do is take a look at the "cause" of the 
CPU detected "trap". If it can do anything about correcting the condition, it 
will do so. 

If this is a Trap E caused by the fact that the memory at this "address" is out 
on the disk, then it will find (or make) a spare RAM page to bring the "memory" 
into, update the page tables for this address to point to this RAM location, 
and cause the CPU to restart the instruction that "failed". In this case, 
neither the application nor the user even knows that a "trap" occurred. Most 
Trap E's are handled in this fashion. 

If it cannot "fix" the "failing address" in the above fashion because it does 
not "know" what should be at this address or if the app has never "allocated" 
the address, it then passes control to any registered "exception handlers" that 
then take a crack at "fixing" the address somehow. If any of these manage to 
fix it up, then OS/2 restarts the instruction just as in the previous case. 
Again, the user usually does not know that a "trap" occurred, although possibly 
an application level exception handler might have taken note of it. It is quite 
possible to have one of these exception handlers subsequently cause a "trap", 
and then we start at the top of this flow with the "new" trap information. 

If neither the OS swap handler nor an exception handler can do anything to "fix 
up" the memory address, then OS/2 has no choice but to "kill" the process 
involved. If it is an application that contains the failing instruction, then 
OS/2 will pop up an "exception screen" telling the user what happened and 
offering to give more information (registers) before "ending" the application. 
This is an "application trap" and is usually accompanied by a sys317x message. 

The same situation(s) could occur in privilege level 0 code (kernel, device 
driver, file system driver, etc) and in that case there is no "process" that 
can be "ended" so the whole system will "halt" giving the user information 
(registers) on the screen as a last gasp before it "halts". This is called an 
Internal Process Error (IPE) or as some choose to call it "the black screen of 
death with meaningless numbers displayed". The system may be "dead", but the 
numbers and letters can be quite meaningful to those with the right knowledge 
and tools. 

Note:  There is a "halt" routine in the kernel, and it can be called from 
various places, including the routine that formats and displays the register 
information from traps. This halt routine will "report" the location that 
called it. In the case of "traps" the halt routine will always report the 
address of the "register display" routine. Useful, but the "address" displayed 
is redundant. The halt routine also displays the level of the system and CPU 
type. If there is formatted register information then that is usually the 
"cause" of the problem and should not be ignored when reporting such. 

No matter if the "trap failure" is in application or privileged code, it is 
important to gather all the information available. It is the starting point for 
the programmer to locate the problem. It may not be enough to fully "find" the 
problem, but you have to start somewhere. General users should not be making 
decisions on what is "important" from these displays, since the process of 
problem analysis may use none, any, or all of this information. 


═══ 2.9. Hardware or Software, and what do I do? ═══

All the previous discussion seems to be oriented to a "software" explanation of 
the causes of "traps". This is because a "trap" is detected by the CPU during 
it's execution of CPU instructions. It presupposes that the tables, 
instructions, data/code pointers, etc are operating correctly in the hardware, 
which it generally does. But if the hardware has a flaw that causes any of 
these values to be "incorrect" then the symptoms will be exactly the same (as 
far as the CPU is concerned) as if it was a "software" failure. Sometimes an 
external piece of hardware can cause a condition (status) that the software 
"driver" never expected (because it will never occur on correctly working 
hardware) which in turn causes the "driver" to "trap" (possibility of trap 3). 

There are all kinds of possibilities and even "combinations" of software and 
hardware that can be related to a "trap". There is no hard and fast rule to 
know which is the problem. But there is some methodology that can be applied to 
make "guesses" about the causes, which can then be "tested" with some 
"actions". 

     If you have 2 systems with the same hardware and same software setup, and 
      it fails on one but not the other then you are probably looking at a 
      hardware failure. 

     If you have multiple systems of different hardware and setups displaying 
      the same types of failures, it is probably software. (The difficulty 
      arises in defining what "same types of failures" means. This presupposes 
      a solid understanding of "failure types" beyond the scope of this simple 
      explanation. It is a lot more than a simple "trap D" type of 
      description.) 

     If the symptom seems to "make sense" from a (larger) software point of 
      view, then pursue it as such. If the symptoms do not make "software 
      sense", or if you have a range of symptoms that do not seem to follow a 
      discernible pattern then suspect hardware. 

     Do NOT assume that hardware is working correctly because it runs 
      diagnostics or worked with another OS/driver/version/etc. Any changes to 
      a system can expose potential hardware problems and diagnostics rarely 
      find intermittent problems. 

     Always keep in mind that you may be going down the wrong path in your 
      diagnosis and "double check" yourself fairly frequently. 

     Gather as much information as you can about the problem, including full 
      "trap register" data. Document step by step descriptions of the way you 
      "create" the problem if at all possible. Document as much about your 
      hardware as you can. Document as much about your system setup as you can. 
      Be prepared to offer any (or all) this information to anyone that is 
      assisting you. You don't have to offer it all at once, but be prepared to 
      do so. 

     Test any "guesses" with actions. If you suspect any hardware, swap or 
      replace it. Try slowing down memory access timing, or turning off CPU 
      cache (BIOS setup). DMA and Bus timing can also be involved. If you 
      suspect software, then replace it with the latest versions and see if it 
      still fails. Gather the same data (as above) and report your problem. If 
      others with the same (or similar) hardware and software setups cannot 
      duplicate your symptoms than either you need to supply more data to them 
      or you possibly have a hardware problem. 

     If the problem is a trap that generates an IPE (and a halted system) then 
      it may be possible for an analyst to get a pretty good idea of the 
      problem purely from the full data on the "trap screen". All of the data 
      may be significant, so don't "skip anything". If the full screen of data 
      is available, then an analyst can usually decide if the trap was detected 
      inside OS/2 code or in a device/FS driver. 

     In many cases, the trap screen alone may not be enough and further data 
      gathering is required. System traces, a system Dump, or installation of a 
      special "debug kernel" for "live debugging" of the problem are 
      possibilities. 

 There is much more that could be discussed or recommended for any particular 
 problem, but the above should suffice to get most people started in a logical 
 approach to finding the cause of a trap. Remember that anyone trying to assist 
 you cannot just "pull a rabbit out of the hat", they have to have some solid 
 data to work with for a starting point. Frequently others have encountered the 
 same problem and have already found the solution. Other people can try the 
 same "tests" that you are doing which will give you an idea if the problem is 
 widespread (software) or unique to you (probably hardware). The more complete 
 you make your description, the more "accurate" the advice you will receive. 

 Oh, one last part to all this. An application (and the system for that matter) 
 consists of multiple modules all working together. You may have a case of one 
 module trying to "use" information supplied by a different module and thereby 
 generating a "trap". The "failing module" is not the one that used the bad 
 information (from the popup screen), but the one that generated it. Deep 
 analysis of this kind of problem cannot be done from trap screen data alone 
 and usually requires further data gathering (as described above). 


═══ 3. Trap Errors ═══

The next sections classify and explain the Trap codes reported by OS/2. 


═══ 3.1. Summary of Trap Errors in OS/2 ═══

The following is a summary of each trap code, listed in numerical order. 

                    ┌───────┬────┬───────────────────────────────────┐
                    │Decimal│Hex │Description                        │
                    ├───────┼────┼───────────────────────────────────┤
                    │00     │0000│Divide by zero error               │
                    ├───────┼────┼───────────────────────────────────┤
                    │01     │0001│Debug Exception                    │
                    ├───────┼────┼───────────────────────────────────┤
                    │02     │0002│Non-Maskable Interrupt (NMI)       │
                    ├───────┼────┼───────────────────────────────────┤
                    │03     │0003│Debug Breakpoint                   │
                    ├───────┼────┼───────────────────────────────────┤
                    │04     │0004│Overflow Detected                  │
                    ├───────┼────┼───────────────────────────────────┤
                    │05     │0005│Bound Range Exceeded               │
                    ├───────┼────┼───────────────────────────────────┤
                    │06     │0006│Invalid Opcode Instruction         │
                    ├───────┼────┼───────────────────────────────────┤
                    │07     │0007│Coprocessor not Available          │
                    ├───────┼────┼───────────────────────────────────┤
                    │08     │0008│Double Fault                       │
                    ├───────┼────┼───────────────────────────────────┤
                    │09     │0009│Coprocessor Segment Overrun        │
                    ├───────┼────┼───────────────────────────────────┤
                    │10     │000A│Invalid Task State Segment         │
                    ├───────┼────┼───────────────────────────────────┤
                    │11     │000B│Segment not Available              │
                    ├───────┼────┼───────────────────────────────────┤
                    │12     │000C│Stack Fault                        │
                    ├───────┼────┼───────────────────────────────────┤
                    │13     │000D│General Protection Fault           │
                    ├───────┼────┼───────────────────────────────────┤
                    │14     │000E│Page Fault                         │
                    ├───────┼────┼───────────────────────────────────┤
                    │15     │000F│Reserved by Intel                  │
                    ├───────┼────┼───────────────────────────────────┤
                    │16     │0010│Coprocessor Error                  │
                    └───────┴────┴───────────────────────────────────┘


═══ 3.2. Explanations of Trap Codes ═══

A trap 0000 occurs when a program attempts to divide a number by zero or the 
result of the operation is too large for the overflow register to handle it. 
[SYS1930] 

A trap 0001 is caused when a program enables the single step interrupt when not 
being run by a debugger. [SYS1931] 

A trap 0002 is caused when an Non-Masked Interrupt (NMI) is generated by the 
system for a catastrophic error. Four possible causes of this are: 

     Error  Cause 
     110    Planar parity error: memory or system board 
     111    I/O parity error: memory adapter or memory 
     112    Watchdog time-out: any adapter, system board 
     113    DMA arbitration time-out: any adapter, system board 

 A trap 0003 is caused when the program called an INT3 without being run by 
 debug. This happened because debugging code was left in the program either 
 accidentally or by design. [SYS1933] 

 A trap 0004 is caused when a program started an INTO instruction without 
 registering an overflow exception handler. [SYS1934] 

 A trap 0005 is caused when a program started a BOUND instruction without 
 registering a bound exception handler. [SYS1935] 

 A trap 0006 is caused when a program started an invalid instruction without 
 registering an invalid opcode exception handler. [SYS1936] 

 A trap 0007 is caused when a program called for a numeric coprocessor 
 instruction without a coprocessor in the system and without registering a 
 processor extension not available exception handler. [SYS1937] 

 A trap 0008 is caused when the processor detects an exception while processing 
 another exception. [SYS1938] 

 A trap 0009 is caused when a program runs a numeric coprocessor instruction 
 that tries to read or write past the end of the storage segment. [SYS1939] 

 A trap 000A is caused when a program attempts a task switch to an invalid task 
 switch segment. [SYS1940] 

 A trap 000B is caused when a program attempts to reference a memory segment 
 that isn't present. [SYS1941] 

 A trap 000C is caused when a program attempts to push more data onto the stack 
 than it can hold, call too many subroutines, take more data off the stack than 
 was pushed onto it or return more subroutines than were called. [SYS1942] 

 A trap 000D is caused (but not limited to) when a program references storage 
 outside the limit of the memory segment, references a storage segment that is 
 restricted to privileged code, references storage with a selector value of 
 zero, writing read-only memory or code segment, reading from an execute-only 
 code segment or loading an invalid value into a selector register. [SYS1943] 

 A trap 000E is caused when a page being referenced is not present in memory, 
 the procedure referencing the page doesn't have enough privilege to access the 
 page or the address range was allocated but no storage is committed. 

 A trap 000F is reserved by Intel. It's not for our use. 

 A trap 0010 is caused when the processor detects an error from the 
 coprocessor, either by hardware or software. 


═══ 3.3. Internal Processing Errors ═══

Although not exactly a trap error, an Internal Processing Error (IPE) is 
associated with a SYS1915 error, which can be caused by the same conditions as 
a trap error and subsequently are solved in the same manner. 

The key to determining the type of error is in interpreting the error message. 
A message "The system has detected an internal processing error at location 
##0160:FFF6FC01-000D:00015C01....." would tell you that you should follow the 
trouble shooting procedures for a trap 000D error. Locate the error message by 
finding the three zeros followed by a letter or number (...01-000D:00...). In 
the IPE message, it will generally fall after the first cluster of digits. 

OS/2 FixPak14 added to the information available for a Trap 0002 by giving 
additional information in the ErrCode field. This is explained in the 
following. 

PROBLEM CONCLUSION: This problem has been corrected in FixPak 14 for Warp. The 
ERRCD field will now contain the values from the hardware port 0x61 and on EISA 
bus will also read port 0x461. The following codes represent the indicated 
actions needed: 

 0000  Software caused NMI 
 0001  RAM error, check memory (parity error) 
 0002  Adapter caused error (I/O channel check) 
 0003  Check bus mastering adapters, update adapter drivers. Also disable bus 
       mastering on failing adapters as a problem determination tool to figure 
       out which adapter is causing the failure. Contact the adapter vendor for 
       further assistance. (DMA timeout) 
 0004  A device driver or Dos application disabled interrupts too long. Contact 
       appropriate software vendor for updated software (Watchdog timeout). 
 0005  Contact application or device driver hardware vendor (software generated 
       NMI). 
 0006  see error code 0003. 
 0007  see error code 0004. 
 0008  see error code 0002. 
 0009  see error code 0001. 


═══ 3.3.1. Related Error Messages ═══

The following built in OS/2 error messages can be queried to see additional 
help for the error codes: 

          For this ERRCD:     Type HELP <number> to get more help 
          0000                1944 
          0001                1945 
          0002                1946 
          0003                1947 
          0004                1948 
          0005                3140 
          0006                3141 
          0007                3142 
          0008                3143 
          0009                3144 

 (Example: Typing HELP 1945 from an OS/2 command prompt will provide additional 
 useful information about the failure that occurred.)