NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / sys / apollo / 3092 < prev next >

Wrap

Text File | 1992-07-23 | 2.4 KB | 47 lines

Newsgroups: comp.sys.apollo Path: sparky!uunet!utcsri!helios.physics.utoronto.ca!alchemy.chem.utoronto.ca!system From: system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) Subject: Re: Problems with 10.4 Message-ID: <1992Jul23.144508.4401@alchemy.chem.utoronto.ca> Organization: University of Toronto Chemistry Department References: <14kv48INNh5r@agate.berkeley.edu> Date: Thu, 23 Jul 1992 14:45:08 GMT Lines: 36 In article <14kv48INNh5r@agate.berkeley.edu> alanc@ocf.berkeley.edu writes: >We recently finished upgrading our cluster of 16 machines (2 DN 4500's + 14 >3500's) to 10.4 - and since then we've been having several problems: > >- csh/tcsh will go into a "can't find any commands" mood, from which the only > way to do anything is exec /bin/ksh. (It seems to be linked to the > tty's as some tty's have the problem & others don't...rebuilding the > tty's "cures" the problem for a few days, but then it comes back) You don't happen to be running NFS on these systems, do you? Since I moved our home directories to a HP-UX system, our DN2500 systems have become so flaky when doing basic things like 'rn' that they are useless; they can not reliably spawn subshells seems to be the problem (since my .cshrc file has to be accessed by NFS). >- processes aren't dying properly. For example a user telnet's in, and > starts reading news. The telnet will die, but it's child csh and the > trn will keep going. In fact, the last process in the chain will > start using huge amounts of processor time. The only solution we've > found for this so far is to check the process table every 10 minutes > and kill -HUP every process who's unix_PPID is 1 (and is not owned > by root/daemon/user/etc.) kill(-9, -1) is also failing at times. This one is caused by improper handling of signals in telnetd and rlogind (and ftpd and ...) - this was changed at SR10.4 so that the members of a process group (which basically will correspond to a login session) are not signalled when the parent of the group dies. The child processes then start eating cpu (reason unknown). I suggest you call HP if you are on support; we called this in back in April (Call # A1975057, A1975055, Escalation issue EPIC 1801). We are testing a fixed rlogind for the DN10000, but no fixes for telnetd, or the M680x0 nodes, after more than 2 months. -- What are the chances that any computer system will ever "work" properly? ... and Slim just left town. -*- Mike Peterson, SysAdmin, U/Toronto Chemistry