NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / unix / wizards / 3627 < prev next >

Wrap

Text File | 1992-08-15 | 7.0 KB | 159 lines

Newsgroups: comp.unix.wizards Path: sparky!uunet!elroy.jpl.nasa.gov!sdd.hp.com!caen!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: MP CPU-time phenomena Message-ID: <1992Aug16.015606.4668@fcom.cc.utah.edu> Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <15250002@hprpcd.rose.hp.com> <1992Aug14.063450.25740@fcom.cc.utah.edu> <1992Aug15.172139.13498@ctr.columbia.edu> Date: Sun, 16 Aug 92 01:56:06 GMT Lines: 147 In article <1992Aug15.172139.13498@ctr.columbia.edu> kibirev@csa.bu.edu (oleg kibirev) writes: [ ... Big, hairy discussion on benchmarking parallel processors using a non parallel algorythm and the fallacies thereof ... ] >Hmmm, won't this work? > >while(fork()>0); >for(;;i++); If you mean "will it operate", the answer is "yes"... however: o On most reasonable systems, the number of forks is not equal to the number of processors available; rather, there is a hard limit on the number of processes you instantiate. o On systems without a hard limit, you will go until either the process table is full or has sizeof(process table) - 1 processes (if you are given the standard safety net so you can log in as root and kill the damn things), or, like a NeXT, your swap file will grow to consume the file system in which it lives, and the machine will crash. o 'i' is presumably a stack or volatile variable, and is not in shared memory. As such, it is most likely located in a copy-on-write (unless your box is copy-on-fork, which is worse). This means you will not increment i once per process, you will increment your particular per-process copy of i. This is relatively useless unless you can arrive at an aggregate value. o If 'i' is in shared memory per-processor access to it will be semaphored. Thus even though it will increment once per process running simultaneously on multiple processors, access to the memory location is per-processor serialized, so you don't get multiple increments in the same set of clock cycles within which a single processor could increment a local variable. Worse, due to the operation of an interprocessor mutex and cache coherency problems, you will actually increment the thing less frequently than if a single processor is trying to increment a variable in local cache. o These boxes have better things to do with their processors, even if your user code is the only thing running as a user app (swapping, to name one). Your processes will not all be guaranteed to get either the same number or size of CPU quantum. How does parallel processing ever do *anything* then? Well, I'm glad I asked for you(8-); that's really the question that needs to be resolved: It does it by paralellizing things which are paralellizable. It may do this at various granularities (from high to low): o Session granularity: I run on one processor, and Phil runs on the other; by virtue of the fact that we each have our own processor, but are using one machine, it looks like we both have shared resources on seperate machines (ASMP and/or things like NFS and lpd can make this seem to happen even if the processors aren't in the same case). o Process granularity (parallel make and similar beasts), where one task is split into multiple processes, which can, but are not required to, run on seperate processors simultaneously. o Thread of execution granularity (POSIX threads/Solaris 2.0/ AIX kernel premption) where: Thread 0: for( i=0; i<4; i++) for( j=0; j<20; j++) x[ i][ j] = foo; Gets split into: Thread 0: for( j=0; j<20; j++) x[ 0][ j] = foo; Thread 1: for( j=0; j<20; j++) x[ 1][ j] = foo; Thread 2: for( j=0; j<20; j++) x[ 2][ j] = foo; Thread 3: for( j=0; j<20; j++) x[ 3][ j] = foo; And each thread is capable of being run by a seperate processor. This relies moderately on compiler technology which is mostly there. It falls into the category of loop unrolling style optimizations. o Code reording granularity. This requires extensive knowledge of the code (either by the programmer, in terms of hints to the compiler, or in terms of syntactically parallel languages), and extreme knowledge by the compiler of the underlying secheduling mechanisms (with hooks for manipulation). This is sometimes called a shared context process centric model. (There are some *very* good papers on this on wuarchive). I guess the whole point of where this is going is "can I add a whole bunch of PC's (or their moral equivalent) together and get supercomputer performance? The answer is yes, but only for parallel dissoluable problems. The original poster was trying to determine how much better a parallel machine was than a non-parallel machine of equivalent capability of a single one of the parallel machines processors. He was unpleasently suprosed that for(;;) i++; didn't increment the variable n times as fast for n processors (for reason's I hope are clear now). The suggestion that he build a model of the parallel behaviour based on the parallelizability of his intended application is not unreasonable. Most client/server databases (Oracle, for instance) benefit greatly from session and process granularity on something like a Sequent. The added benefits of SMP vs. ASMP in this case are dynamic load balancing and distribution of "interrupt" style processing within the kernel (an ASMP machine, for instance, must direct all I/O through a single processor as a serialization mechanism). I won't argue the (de)merits of MP computing here; even though there are problems which can't be dissolved at any of the granularities currently supported, there are other problems that lend themselves to partial parallelization, or even total paralellization (fluidic modelling, to name one). Whichever case the final intended application of the original poster falls into, predicting the behaviour certainly can't be done without a model (ie: a test program) that dissolves to the same granularity as the final application. for(;;) i++; isn't rationally divisible (ie: dividing it may in fact force it to take longer) and thus models a non parallelizable application. while(fork()>0); for(;;i++); is rationally divisable if it isn't the same i in each child at process granularity. If the intended application does not fit either category, a new model is needed; if it *does* fit either category, than he's probably paying a premium for resource sharing between what are in effect multiple seperate machines in the same box. Thus either the test model was broken or the expectation of the user were broken. Terry Lambert terry_lambert@gateway.novell.com terry@icarus.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- terry@icarus.weber.edu "I have an 8 user poetic license" - me -------------------------------------------------------------------------------