NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / unix / aix / 9622 < prev next >

Wrap

Internet Message Format | 1992-09-13 | 5.6 KB

Path: sparky!uunet!know!hri.com!noc.near.net!news.Brown.EDU!qt.cs.utexas.edu!cs.utexas.edu!usc!wupost!darwin.sura.net!jvnc.net!rutgers!modus!gear!cadlab!martelli From: martelli@cadlab.sublink.org (Alex Martelli) Newsgroups: comp.unix.aix Subject: Re: AIX malloc and fault tolerance Message-ID: <1992Sep11.101004.1401@cadlab.sublink.org> Date: 11 Sep 92 10:10:04 GMT References: <1992Sep3.135156.9166@medtron.medtronic.com> <1258@curly.appmag.com> Organization: CAD.LAB S.p.A., Bologna, Italia Lines: 94 pa@curly.appmag.com (Pierre Asselin) writes: :In <1992Sep3.135156.9166@medtron.medtronic.com> : sh0001@israel (Scott Hansohn) writes: : :[He bumped into the malloc virtual allocation nonsense again. : I still get mad just thinking about it.] DITTO! And the excuses we get about it look just like that, EXCUSES! :Some arguments to the effect that vapour-memory was a good thing were: : -Lets you use gigantic sparse arrays. : -Lets vendors ship Fortran binaries with static arrays dimensioned : to maximum size, and yet have them run on small machines for small : problems that use only part of the arrays. : :I'm skeptical. Sparse arrays at 4kB/page? As for the Fortran bit, it :only makes sense on machines dedicated to a single application. That :sure isn't the way we use ours. Dedicating a WS to a single (set of) application IS the typical way that cad.lab customers would use their machines, and we are heavy Fortran users, and the malloc()-but-not-really idea STILL stinks. We are selling INDUSTRIAL STRENGTHS applications, that will be used for CRUCIAL PRODUCTION WORK; it's COMPLETELY UNACCEPTABLE for our customers to lose data because the application dumps abruptly!!! So our apps are full of error-checks. In particular, we do NOT place data, that will grow for large problems, inside Fortran arrays; they reside, instead, in areas which are dynamically allocated by an underlying library written in C, and accessed via functions or subroutines by the Fortran portions. On machines where malloc() semantics make sense, the C routine will return an error indicator to the Fortran portion if it's unable to get the memory requested; in this case, the application communicates to the interactive user that the requested operation cannot be completed due to running out of virtual memory, but the app is still alive and the user can save hir work so far, and restart from there presumably after having swapspace reconfigured. We've been particularly careful that nothing in the save-to-disk subsystem NEEDS to allocate extra memory, so that the saving will work even in crucial memory-low situations; we even had to recode the output-to-file portions as C subroutines running over low-level systemcalls, as we found with surprise that Fortran I/O, and C stdio, on some platforms, may need a malloc() to succeed and will die if it fails (and, yes, our applications ARE and WILL REMAIN extremely portable code). All this care, of course, is for naught on the IBM R/6000 (thankfully we don't presently run on DG Aviion, where malloc() reportedly's similarly broken). And no, we can't just set "limit datasize" appropriately, because it depends on what the user is doing exactly: sometimes the 3D modeler will be running alone, other times it will be scheduled together with the 2D drafter and/or the surface renderer and/or the relational database and/or the tool which builds programs for numerically controlled tools and/or... each of these applications is written to be able to run alone OR communicate with its brethren. We've tried the tricks IBM suggested to stop our application from dying in unexpected places, but what happens then is that OTHER processes die -- and the first to go is typically the X server (a memory hog, I guess!), so the user cannot communicate with the apps to ask to save... and NO, we CANNOT just do the saving from the SIGDANGER handler as a safetynet; the handler can be basically entered from anywhere in the application, including "critical sections" where the data structures are in transition and inconsistent (and NO, we CANNOT protect the critical sections by turning off signals there, or we'll die for lack of SIGDANGER handling). Yes, I know that a thousand clever tricks spring to mind to workaround one of the other of these problems, but believe me: we must have tried at least 900 of them and they don't work. We've spent more time and effort on battling this malloc() idiocy than on any other single porting problem EVER (and with the huge list of platforms we've supported over the years we've had quite SOME such problems, believe you me!)!!! Most porting problems come from bugs in the target system, some from bugs in our code, but here we're fighting against something BROKEN AS DESIGNED -- ***HORRIBLY*** BROKEN. I would say it's been half the cost of the IBM R/6000 port, if it weren't for the fact that the monstruously slow linker (thankfully remedied in 3.2, but this port was started right at system announcement...) and the bugs in the early X have driven that cost way up. Anyway, at the end, we've given up and just document to our customers how AND WHY their work may go up in smoke on IBM R/6000 and not on DEC, Olivetti, Sun, HP, Sony or other platforms. If IBM ever gives us a malloc() WHICH WORKS, we'll be glad to use it. And I hope that periodically rekindled flames about it will do some good -- if we could get together with everybody who's suffered for this and blackmail IBM into it the world would become a better place in at least this small way... -- Email: martelli@cadlab.sublink.org Phone: ++39 (51) 6130360 CAD.LAB s.p.a., v. Ronzani 7/29, Casalecchio, Italia Fax: ++39 (51) 6130294