home *** CD-ROM | disk | FTP | other *** search
- Subject: Novell E-Net Shell Loading Problem (OS Problem)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
- Technical Description and Resolution
- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
- !!! NOTE: This problem has been fixed in later releases !!!
- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-
- NOTE: This problem is actually an operating system bug. The
- probability that this problem will ever be encountered in the field
- is extremely low. We encountered the problem only because we were
- extensively testing every possible configuration. It is somewhat
- unlikely that the configuration on which we found the problem would
- actually be used by NetWare users.
-
- PROBLEM: Under specific circumstances, the Novell E-Net shell does not
- properly load about 15% of the time. This occurs only under
- the following conditions. The file server is an IBM PC XT
- running NetWare 86 v2.0a with a Novell E-Net interface card.
- The network consists of at least 3 workstations, all with
- Novell E-Net interface cards. All but one of the
- workstations are logged in and are running disk intensive
- programs. The last workstation then attempts to load the
- shell, but about 15% of the time it gets the following error
- message before getting fully connected to the file server:
-
- Network Error on Server SERVERNAME:Error reading from
- network. Abort or Retry?
-
- Attempting to retry only causes the error to be redisplayed.
- The workstation must then be rebooted before attempting to
- reload the shell. Once the shell is successfully loaded, the
- system will run indefinitely without any errors.
-
- CAUSE: This error is caused by the way that NetWare buffers incoming
- packets, coupled with the very high network speed of E-Net
- and the extreme slowness of the IBM PC XT hard disk.
-
- The problem occurs when the workstation that is attempting
- to load the shell gets out of synchronization with the file
- server. This occurs just as the shell is attempting to
- request initial service from the server. This is the most
- critical point of communication between the workstation and
- the file server. From then on until the workstation is
- rebooted, communication between the workstation and the file
- server is essentially deadlocked. The workstation keeps
- sending requests to the server, but the server keeps ignoring
- them because they are in the wrong sequence from that which
- it is expecting.
-
- This error condition has nothing to do with the Novell E-Net
- boards except that the condition was aggravated by the
- network's high speed. Potentially, this error could occur
- with any very high speed network running on a file server
- that has an extremely slow hard disk.
-
-
- TECHNICAL
- EXPLANATION: When the shell is first loaded on a workstation, it sends an
- "Allocate Slot Request Packet" to the file server requesting
- it to open a slot for the workstation. The handling of this
- packet is critical because several parameters are initialized
- at this time. If the file server is doing considerable
- processing when receive packets arrive, they are placed in
- a LIFO buffer known as the Turbo Receive Buffer. The
- "Allocate Slot Request Packet" is placed in the buffer along
- with other packets until the operating system gets around to
- processing them. If the file server is extremely busy, the
- workstation shell times-out. Thinking that the request packet
- may have gotten lost, it sends a retry "Allocate Slot Request
- Packet". This retry packet is also received and stored in the
- Turbo Receive Buffers.
-
- Now the operating system finally completes its other tasks and
- starts to process the incoming packets. Because the buffer
- is a LIFO, the retry "Allocate Slot Request Packet" is
- processed first. The slot parameters are initialized
- including the packet sequence number. The packet sequence
- number is the number of the next packet in the sequence of
- communications between the file server and that specific
- workstation. A reply is generated and sent back to the
- workstation incrementing the packet sequence number stored in
- the file server. The workstation then sends its next request
- packet again incrementing the packet sequence number. The
- file server buffers the incoming packet and eventually
- processes it and sends back a reply packet. Again the packet
- sequence number is incremented.
-
- Finally, the file server gets around to processing the
- original "Allocate Slot Request Packet" that had been buried
- in the bottom of the stack. This packet causes the file
- server to reinitialize the slot parameters for that particular
- workstation including the packet sequence number. The file
- server then sends a reply for this packet out which is ignored
- by the workstation because it is the wrong sequence number.
-
- Now the file server will no longer accept and process packets
- from the workstation because the sequence numbers are out of
- synchronization. The workstation is attempting to send valid
- packets with valid sequence numbers to the file server but
- since the file server's sequence number counter has been
- reinitialized, none of these packets are recognized and they
- are discarded. For example, the file server is expecting
- packet number two (since the sequence number was
- reinitialized) but the workstation is attempting to send
- packet number four or higher. The workstation can retry
- sending the new packets to the file server forever and the
- server will never process them. Thus the communication
- between the file server and the workstation is deadlocked
- until the workstation is rebooted and the shell sends a new
- "Allocate Slot Request Packet."
-
- If the shell successfully loads and connects to the file
- server it means that the "Allocate Slot Request Packet" has
- been serviced properly without duplication. The system then
- will continue to correctly operate indefinitely. The packets
- may get processed out of order because of the LIFO reordering.
-
- However, this has no effect on the processing because the
- sequence numbers are still synchronized between the file
- server and the workstation.
-
- SOLUTION: Since the problem described above is directly linked to the
- operating system, the best way to eliminate the problem would
- be to modify the operating system code. This could be an
- update consideration in future releases of NetWare.
-
- Since the error is noncritical and recoverable, no immediate
- solution is being sought. This decision is based upon the
- following facts. The error occurs only under a very specific
- set of rare circumstances. The error only occurs about 15%
- of the time under these circumstances and it is easily
- recovered from by rebooting and reloading the shell. Once the
- shell is successfully loaded, no further problems are
- experienced.
-
- TIC: date=3-30-87, ref#=031887.008, status=RESOLVED
-