home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.unix.wizards
- Path: sparky!uunet!elroy.jpl.nasa.gov!sdd.hp.com!caen!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: MP CPU-time phenomena
- Message-ID: <1992Aug16.015606.4668@fcom.cc.utah.edu>
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <15250002@hprpcd.rose.hp.com> <1992Aug14.063450.25740@fcom.cc.utah.edu> <1992Aug15.172139.13498@ctr.columbia.edu>
- Date: Sun, 16 Aug 92 01:56:06 GMT
- Lines: 147
-
- In article <1992Aug15.172139.13498@ctr.columbia.edu> kibirev@csa.bu.edu (oleg kibirev) writes:
- [ ... Big, hairy discussion on benchmarking parallel processors using a non
- parallel algorythm and the fallacies thereof ... ]
- >Hmmm, won't this work?
- >
- >while(fork()>0);
- >for(;;i++);
-
- If you mean "will it operate", the answer is "yes"... however:
-
- o On most reasonable systems, the number of forks is not equal to
- the number of processors available; rather, there is a hard limit
- on the number of processes you instantiate.
-
- o On systems without a hard limit, you will go until either the process
- table is full or has sizeof(process table) - 1 processes (if you
- are given the standard safety net so you can log in as root and kill
- the damn things), or, like a NeXT, your swap file will grow to consume
- the file system in which it lives, and the machine will crash.
-
- o 'i' is presumably a stack or volatile variable, and is not in shared
- memory. As such, it is most likely located in a copy-on-write
- (unless your box is copy-on-fork, which is worse). This means you
- will not increment i once per process, you will increment your
- particular per-process copy of i. This is relatively useless unless
- you can arrive at an aggregate value.
-
- o If 'i' is in shared memory per-processor access to it will be
- semaphored. Thus even though it will increment once per process
- running simultaneously on multiple processors, access to the memory
- location is per-processor serialized, so you don't get multiple
- increments in the same set of clock cycles within which a single
- processor could increment a local variable. Worse, due to the
- operation of an interprocessor mutex and cache coherency problems,
- you will actually increment the thing less frequently than if a
- single processor is trying to increment a variable in local cache.
-
- o These boxes have better things to do with their processors, even if
- your user code is the only thing running as a user app (swapping,
- to name one). Your processes will not all be guaranteed to get
- either the same number or size of CPU quantum.
-
- How does parallel processing ever do *anything* then? Well, I'm
- glad I asked for you(8-); that's really the question that needs to be
- resolved:
-
- It does it by paralellizing things which are paralellizable. It may
- do this at various granularities (from high to low):
-
- o Session granularity: I run on one processor, and Phil runs
- on the other; by virtue of the fact that we each have our
- own processor, but are using one machine, it looks like
- we both have shared resources on seperate machines (ASMP
- and/or things like NFS and lpd can make this seem to happen
- even if the processors aren't in the same case).
-
- o Process granularity (parallel make and similar beasts), where
- one task is split into multiple processes, which can, but
- are not required to, run on seperate processors simultaneously.
-
- o Thread of execution granularity (POSIX threads/Solaris 2.0/
- AIX kernel premption) where:
-
- Thread 0: for( i=0; i<4; i++)
- for( j=0; j<20; j++)
- x[ i][ j] = foo;
-
- Gets split into:
-
- Thread 0: for( j=0; j<20; j++)
- x[ 0][ j] = foo;
- Thread 1: for( j=0; j<20; j++)
- x[ 1][ j] = foo;
- Thread 2: for( j=0; j<20; j++)
- x[ 2][ j] = foo;
- Thread 3: for( j=0; j<20; j++)
- x[ 3][ j] = foo;
-
- And each thread is capable of being run by a seperate
- processor. This relies moderately on compiler technology
- which is mostly there. It falls into the category of loop
- unrolling style optimizations.
-
- o Code reording granularity. This requires extensive knowledge
- of the code (either by the programmer, in terms of hints to
- the compiler, or in terms of syntactically parallel languages),
- and extreme knowledge by the compiler of the underlying
- secheduling mechanisms (with hooks for manipulation). This
- is sometimes called a shared context process centric model.
- (There are some *very* good papers on this on wuarchive).
-
-
-
-
- I guess the whole point of where this is going is "can I add a whole
- bunch of PC's (or their moral equivalent) together and get supercomputer
- performance? The answer is yes, but only for parallel dissoluable problems.
-
-
- The original poster was trying to determine how much better a
- parallel machine was than a non-parallel machine of equivalent capability
- of a single one of the parallel machines processors. He was unpleasently
- suprosed that for(;;) i++; didn't increment the variable n times as fast
- for n processors (for reason's I hope are clear now).
-
- The suggestion that he build a model of the parallel behaviour based
- on the parallelizability of his intended application is not unreasonable.
- Most client/server databases (Oracle, for instance) benefit greatly from
- session and process granularity on something like a Sequent. The added
- benefits of SMP vs. ASMP in this case are dynamic load balancing and
- distribution of "interrupt" style processing within the kernel (an ASMP
- machine, for instance, must direct all I/O through a single processor as
- a serialization mechanism).
-
- I won't argue the (de)merits of MP computing here; even though
- there are problems which can't be dissolved at any of the granularities
- currently supported, there are other problems that lend themselves to
- partial parallelization, or even total paralellization (fluidic modelling,
- to name one). Whichever case the final intended application of the original
- poster falls into, predicting the behaviour certainly can't be done without
- a model (ie: a test program) that dissolves to the same granularity as the
- final application.
-
- for(;;) i++; isn't rationally divisible (ie: dividing it may in fact
- force it to take longer) and thus models a non parallelizable application.
-
- while(fork()>0); for(;;i++); is rationally divisable if it isn't the
- same i in each child at process granularity.
-
- If the intended application does not fit either category, a new model
- is needed; if it *does* fit either category, than he's probably paying a
- premium for resource sharing between what are in effect multiple seperate
- machines in the same box. Thus either the test model was broken or the
- expectation of the user were broken.
-
-
- Terry Lambert
- terry_lambert@gateway.novell.com
- terry@icarus.weber.edu
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- terry@icarus.weber.edu
- "I have an 8 user poetic license" - me
- -------------------------------------------------------------------------------
-