home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!know!hri.com!noc.near.net!news.Brown.EDU!qt.cs.utexas.edu!cs.utexas.edu!usc!wupost!darwin.sura.net!jvnc.net!rutgers!modus!gear!cadlab!martelli
- From: martelli@cadlab.sublink.org (Alex Martelli)
- Newsgroups: comp.unix.aix
- Subject: Re: AIX malloc and fault tolerance
- Message-ID: <1992Sep11.101004.1401@cadlab.sublink.org>
- Date: 11 Sep 92 10:10:04 GMT
- References: <1992Sep3.135156.9166@medtron.medtronic.com> <1258@curly.appmag.com>
- Organization: CAD.LAB S.p.A., Bologna, Italia
- Lines: 94
-
- pa@curly.appmag.com (Pierre Asselin) writes:
-
- :In <1992Sep3.135156.9166@medtron.medtronic.com>
- : sh0001@israel (Scott Hansohn) writes:
- :
- :[He bumped into the malloc virtual allocation nonsense again.
- : I still get mad just thinking about it.]
-
- DITTO! And the excuses we get about it look just like that, EXCUSES!
-
- :Some arguments to the effect that vapour-memory was a good thing were:
- : -Lets you use gigantic sparse arrays.
- : -Lets vendors ship Fortran binaries with static arrays dimensioned
- : to maximum size, and yet have them run on small machines for small
- : problems that use only part of the arrays.
- :
- :I'm skeptical. Sparse arrays at 4kB/page? As for the Fortran bit, it
- :only makes sense on machines dedicated to a single application. That
- :sure isn't the way we use ours.
-
- Dedicating a WS to a single (set of) application IS the typical way that
- cad.lab customers would use their machines, and we are heavy Fortran
- users, and the malloc()-but-not-really idea STILL stinks. We are
- selling INDUSTRIAL STRENGTHS applications, that will be used for CRUCIAL
- PRODUCTION WORK; it's COMPLETELY UNACCEPTABLE for our customers to lose
- data because the application dumps abruptly!!! So our apps are full of
- error-checks.
-
- In particular, we do NOT place data, that will grow for large problems,
- inside Fortran arrays; they reside, instead, in areas which are
- dynamically allocated by an underlying library written in C, and
- accessed via functions or subroutines by the Fortran portions. On
- machines where malloc() semantics make sense, the C routine will return
- an error indicator to the Fortran portion if it's unable to get the
- memory requested; in this case, the application communicates to the
- interactive user that the requested operation cannot be completed due to
- running out of virtual memory, but the app is still alive and the user
- can save hir work so far, and restart from there presumably after having
- swapspace reconfigured.
-
- We've been particularly careful that nothing in the save-to-disk
- subsystem NEEDS to allocate extra memory, so that the saving will work
- even in crucial memory-low situations; we even had to recode the
- output-to-file portions as C subroutines running over low-level
- systemcalls, as we found with surprise that Fortran I/O, and C stdio, on
- some platforms, may need a malloc() to succeed and will die if it fails
- (and, yes, our applications ARE and WILL REMAIN extremely portable
- code).
-
- All this care, of course, is for naught on the IBM R/6000 (thankfully we
- don't presently run on DG Aviion, where malloc() reportedly's similarly
- broken). And no, we can't just set "limit datasize" appropriately,
- because it depends on what the user is doing exactly: sometimes the 3D
- modeler will be running alone, other times it will be scheduled together
- with the 2D drafter and/or the surface renderer and/or the relational
- database and/or the tool which builds programs for numerically
- controlled tools and/or... each of these applications is written to be
- able to run alone OR communicate with its brethren.
-
- We've tried the tricks IBM suggested to stop our application from dying
- in unexpected places, but what happens then is that OTHER processes
- die -- and the first to go is typically the X server (a memory hog, I
- guess!), so the user cannot communicate with the apps to ask to save...
- and NO, we CANNOT just do the saving from the SIGDANGER handler as a
- safetynet; the handler can be basically entered from anywhere in the
- application, including "critical sections" where the data structures
- are in transition and inconsistent (and NO, we CANNOT protect the
- critical sections by turning off signals there, or we'll die for
- lack of SIGDANGER handling).
-
- Yes, I know that a thousand clever tricks spring to mind to workaround
- one of the other of these problems, but believe me: we must have tried
- at least 900 of them and they don't work. We've spent more time and
- effort on battling this malloc() idiocy than on any other single porting
- problem EVER (and with the huge list of platforms we've supported over
- the years we've had quite SOME such problems, believe you me!)!!! Most
- porting problems come from bugs in the target system, some from bugs in
- our code, but here we're fighting against something BROKEN AS DESIGNED
- -- ***HORRIBLY*** BROKEN. I would say it's been half the cost of the
- IBM R/6000 port, if it weren't for the fact that the monstruously slow
- linker (thankfully remedied in 3.2, but this port was started right at
- system announcement...) and the bugs in the early X have driven that
- cost way up. Anyway, at the end, we've given up and just document to
- our customers how AND WHY their work may go up in smoke on IBM R/6000
- and not on DEC, Olivetti, Sun, HP, Sony or other platforms.
-
- If IBM ever gives us a malloc() WHICH WORKS, we'll be glad to use it.
- And I hope that periodically rekindled flames about it will do some
- good -- if we could get together with everybody who's suffered for
- this and blackmail IBM into it the world would become a better place
- in at least this small way...
- --
- Email: martelli@cadlab.sublink.org Phone: ++39 (51) 6130360
- CAD.LAB s.p.a., v. Ronzani 7/29, Casalecchio, Italia Fax: ++39 (51) 6130294
-