home *** CD-ROM | disk | FTP | other *** search
Text File | 1991-02-06 | 42.7 KB | 1,046 lines |
- .\" indxbib bib
- .\" refer -e -p bib a | pic | tbl out1 | /usr/5bin/eqn -Ti300 | \
- .\" /usr/5lib/troff -Ti300 -ms - | dimp -t | qpr -q im217-raw
- .\"
- .so /usr/lib/tmac/tmac.srefs
- .EQ
- delim $$
- .EN
- .\"gsize 24
- .RP
- .TL
- Excerpts from
- .br
- ``Workstations, Networking,
- .br
- Distributed Graphics,
- .br
- and
- .br
- Parallel Processing''
- .AU
- Michael John Muuss
- .AI
- Leader, Advanced Computer Systems Team
- Ballistic Research Laboratory
- Aberdeen Proving Ground
- Maryland 21005-5066 USA
- .AB
- .LP
- The process of design is iterative in nature;
- designs are formulated, analyzed, and improved, until the goals are met.
- Modern graphics workstations provide a powerful platform for
- performing detailed design and limited analysis of solid model designs,
- with faster computers accessed via network links for full resolution analysis.
- A broad spectrum of analysis tools exist, and most express their
- output in graphical form.
- Their three main styles of graphics display will be examined, along
- with a look at the underlying software mechanisms to support them.
- .LP
- Several enabling technologies
- are required for analysis tools to run on one computer,
- with graphical display on another computer, including the
- network transparent framebuffer capability.
- The importance of portable external data representations will be
- reviewed, and several specific
- external data representations will be examined in significant detail.
- By carefully dividing application software into appropriate tools
- and connecting them with UNIX pipes, a measure of parallel processing
- can be achieved within one system.
- In a tool oriented environment
- with machine independent data formats,
- network distributed computation can be accomplished.
- .LP
- The next step is to
- make a single tool execute faster using parallel processing.
- Many analysis codes are implemented using the ray-tracing paradigm
- which is ideal for execution in parallel on
- tightly coupled shared-memory multiprocessors and loosely
- coupled ensembles of computers.
- Both are exploited, using different
- mechanisms.
- The strategies used for operating
- on shared-memory multiprocessors such as the Denelcor HEP, Alliant FX/8, and
- Cray X-MP will be presented, along with measured performance data.
- .LP
- The strategies used for dividing the work among network connected
- loosely coupled processors (each of which may themselves be a parallel
- processor)
- are presented, including details
- of the dispatching algorithm, and the design of the distribution protocol.
- The performance issues of this type of parallel processing will be presented,
- including a
- set of measured speeds on a variety of hardware.
- .AE
- .RT
- .\" Make troff use nroff style bracketed references
- .ds [. " [
- .ds .] ]
- .NH 1
- (Early Chapers Omitted)
- .NH 1
- SHARED-MEMORY PARALLEL PROCESSING
- .PP
- In the preceding section, we have seen how
- different processors
- can be harnessed to achieve a single goal.
- The discussion so far has focused on using multiple processors
- (a) within a single, multi-CPU system, through the use of UNIX pipes,
- and (b) by distributing different tools in a pipeline to different
- machines on the network.
- This section extends the investigation into parallel processing
- one level further, to harness the power of multiple processor
- machines to make a single application run faster.
- For the purposes of this discussion, the application to be
- parallelized will be a ray-tracing program, but the techniques
- developed here are quite general.
- .NH 2
- The Need for Speed
- .PP
- Images created using ray-tracing have a reputation for consuming large
- quantities of computer time. For complex models,
- 10 to 20 hours of processor time to render a single frame
- on a DEC VAX-11/780 class machine is not uncommon.
- Using the ray-tracing paradigm for engineering analysis
- .[
- Muuss Understanding Solid Models
- .]
- often requires many times more processing than rendering
- a view of the model. Examples of such engineering analyses
- include the predictive calculation of radar cross-sections,
- heat flow, and bi-static laser reflectivity. For models of
- real-world geometry, running these analyses approaches
- the limits of practical execution times, even with modern SuperComputers.
- There are three main strategies that are being employed to
- attempt to decrease the amount of elapsed time it takes to
- ray-trace a particular scene:
- .IP 1)
- Advances in algorithms for ray-tracing.
- Newer techniques in partitioning space
- .[
- Kaplan Space-Tracing
- .]
- and in taking advantage of ray-to-ray coherence
- .[
- Arvo Ray Classification
- .]
- promise to continue to yield algorithms that do fewer and fewer
- ray/object intersections which do not contribute to the final results.
- Significant work remains to be done in this area, and an order of
- magnitude performance gain remains to be realized. However, there
- is a limit to the gains that can be made in this area.
- .IP 2)
- Acquiring faster processors. A trivial method for decreasing the elapsed
- time to run a program is to purchase a faster computer. However, even
- the fastest general-purpose computers such as the Cray X-MP and Cray-2
- do not execute fast enough to permit practical analysis of all
- real-world models in appropriate detail. Furthermore, the speed of light
- provides an upper bound on the fastest computer that can be built out of
- modern integrated circuits; this is already a significant
- factor in the Cray X-MP and Cray-2 processors, which operate with 8.5 ns
- and 4.3 ns clock periods respectively.
- .IP 3)
- Using multiple processors to solve a single problem. By engaging the
- resources of multiple processors to work on a single problem, the
- speed-of-light limit can be circumvented. However, the price is that
- explicit attention must be paid to the distribution of data to the
- various processors, synchronization of the computations, and collection
- of the results.
- .PP
- Parallel processing is still a relatively young art, and presently there
- is only limited support available for
- the automatic parallelization of existing
- code, with newer vendors like Alliant leading the crowd.
- For now, there are few general techniques
- for taking programs intended
- for serial operation on a single processor, and automatically adapting
- them for operation on multiple processors.
- .[
- Ohr Minisupercomputers Speed
- .]
- The \fBWorm\fR program developed at Xerox PARC
- .[
- Shoch Worm Distributed Computation
- .]
- is one of the earliest known network image-rendering applications.
- More recently at Xerox PARC, Frank Crow has attempted to distribute
- the rendering of a single image across multiple processors,
- .[
- Crow Distributed Execution Work in Progress
- .]
- but discovered that communication overhead and synchronization problems
- limited parallelism to about 30% of the available processing power.
- A good summary of work to date has been collected by Peterson.
- .[
- Peterson Distributed Computation
- .]
- .PP
- Ray-tracing analysis of a model has the very nice property that
- the computations for each ray/model intersection are entirely
- independent of other ray/model intersection calculations.
- Therefore, it is easy to see how the calculations for each ray
- can be performed by separate, independent processors. The
- underlying assumption is that each processor has read-only access
- to the entire model database.
- While it would be possible to partition the ray-tracing algorithm
- in such a way as to require only a portion of the model database
- being resident in each processor, this would significantly increase
- the complexity of the implementation as well as the
- amount of synchronization and control traffic needed.
- Such a partitioning has therefore
- not yet been seriously attempted.
- .PP
- It is the purpose of the research reported in the rest of this paper to
- explore the performance limits of parallel operation of ray-tracing
- algorithms where available processor memory is not a limitation.
- While it is not expected that this research will result
- in a general purpose technique for distributing arbitrary programs
- across multiple processors, the issues of the control and distribution of work
- and providing reliable results in a potentially unreliable system
- are quite general. The techniques used here are likely to be
- applicable to a large set of other applications.
- .NH 2
- Raytracing Background
- .PP
- The origins of modern ray-tracing come from work at MAGI
- under contract to BRL, initiated in the early 1960s.
- The initial results were reported by MAGI
- .[
- geometric description technique MAGI
- .]
- in 1967.
- Extensions to the early developments were undertaken by a
- DoD Joint Technical Coordinating Group effort, resulting in
- publications in 1970
- .[
- MAGIC User Manual
- .]
- and 1971.
- .[
- MAGIC Analyst Manual
- .]
- A detailed presentation of the
- fundamental analysis and implementation of the ray-tracing algorithm
- can be found in these two documents.
- Also see
- .[
- Appel shading machine renderings solids
- .]
- .PP
- More recently, interest in ray-tracing developed in the academic
- community, with Kay's
- .[
- Kay Ray Tracing 1979
- .]
- thesis in 1979
- being a notable early work.
- One of the central papers in the ray-tracing literature is
- the work of Whitted.
- .[
- Whitted Improved Illumination Model
- .]
- Model sampling techniques can be improved to provide substantially
- more realistic images by using the ``Distributed Ray Tracing'' strategy.
- .[
- Cook Carpenter Distributed Ray Tracing
- .]
- For an excellent, concise discussion of ray-tracing,
- consult pages 363-381 of Rogers.
- .[
- Rogers Procedural Elements
- .]
- .PP
- There are several implementation strategies for interrogating
- the model by computing ray/geometry intersections.
- The traditional approach has been batch-oriented,
- with the user defining a set of ``viewing angles'',
- turning loose a big batch job to compute all the ray intersections,
- and then post-processing all the ray data into some meaningful form.
- However, the major drawback of this approach is that the application
- has no dynamic control over ray paths, making another batch run
- necessary for each level of reflection, etc.
- .PP
- In order to be successful, applications need: (1) dynamic control of ray
- paths, to naturally implement reflection, refraction, and fragmentation into
- multiple subsidiary rays, and (2) the ability to fire rays in arbitrary
- directions from arbitrary points.
- Nearly all non-batch ray-tracing implementations have a specific closely
- coupled application (typically a model of illumination), which
- allows efficient and effective control of the ray paths.
- However, the most flexible approach is to implement the ray-tracing
- capability as a general-purpose library, to make the functionality
- available to any application as needed, and
- this is the approach taken in the BRL CAD Package.
- .[
- Muuss CAD Package Release 1.21
- .]
- The ray-tracing library is called \fBlibrt\fR, while the ray-tracing
- application of interest here (an optical spectrum lighting model)
- is called \fBrt\fR.
- .NH 2
- The Structure of librt
- .PP
- In order to give all applications dynamic control over
- the ray paths, and to allow the rays to be fired in arbitrary directions
- from arbitrary points, BRL has implemented its third generation
- ray-tracing capability as a set of library routines.
- \fBlibrt\fR exists to allow application programs to
- intersect rays with model geometry. There are four parts to the
- interface: three preparation routines and the actual ray-tracing routine.
- The first routine which must be called is
- rt_dirbuild(), which opens the database file, and builds the
- in-core database table of contents.
- The second routine to be called is rt_gettree(), which
- adds a database sub-tree to the active model space.
- rt_gettree() can be called multiple times
- to load different parts of the database
- into the active model space.
- The third routine is rt_prep(), which
- computes the space partitioning data structures and does other
- initialization chores.
- Calling this routine is optional,
- as it will be called by rt_shootray() if needed.
- rt_prep() is provided as a separate routine to
- allow independent timing of the preparation and ray-tracing phases of
- applications.
- .PP
- To compute the intersection of a ray with the geometry in the active
- model space, the application must call rt_shootray() once for each
- ray. Ray-path selection for perspective, reflection, refraction, etc,
- is entirely determined by the application program. The only parameter
- to the rt_shootray() is a \fBlibrt\fR ``application'' structure, which
- contains five major elements: the vector a_ray.r_pt which
- is the starting point of the ray to be fired, the vector a_ray.r_dir
- which is the unit-length direction vector of the ray,
- the pointer *a_hit() which is the address of an application-provided
- routine to call when the ray intersects the model geometry, the pointer
- *a_miss() which is the address of an application-provided routine to
- call when the ray does not hit any geometry, the flag a_onehit which is
- set non-zero to stop ray-tracing as soon as the ray has intersected at
- least one piece of geometry (useful for lighting models), plus various
- locations for each application to store state (recursion level, colors,
- etc). Note that the integer returned from the application-provided
- a_hit()/a_miss() routine is the formal return of the function
- rt_shootray(). The rt_shootray() function is prepared for full recursion
- so that the a_hit()/a_miss() routines can themselves fire additional
- rays by calling rt_shootray() recursively before deciding their own
- return value.
- .PP
- In addition, the function rt_shootray() is serially and concurrently
- reentrant, using only registers, local variables allocated on the stack, and
- dynamic memory allocated with rt_malloc().
- The rt_malloc() function serializes calls to \fBmalloc\fR(3).
- By having the ray-tracing library fully prepared to run
- in parallel with other instances of itself in the same address space,
- applications may take full advantage of parallel hardware capabilities,
- where such capabilities exist.
- .NH 2
- A Sample Ray-Tracing Program
- .PP
- A simple application program that fires one ray at a model and prints
- the result is included below, to demonstrate the simplicity of the
- interface to \fBlibrt\fR.
- .LP
- .sp .5
- .nf
- .ne 2i
- #include <brlcad/raytrace.h>
- struct application ap;
- main() {
- rt_dirbuild("model.g");
- rt_gettree("car");
- rt_prep();
- ap.a_point = [ 100, 0, 0 ];
- ap.a_dir = [ -1, 0, 0 ];
- ap.a_hit = &hit_geom;
- ap.a_miss = &miss_geom;
- ap.a_onehit = 1;
- rt_shootray( &ap );
- }
- hit_geom(app, part)
- struct application *app;
- struct partition *part;
- {
- printf("Hit %s", part->pt_forw->pt_regionp->reg_name);
- }
- miss_geom(){
- printf("Missed");
- }
- .fi
- .NH 2
- Normal Operation: Serial Execution
- .PP
- When running the \fBrt\fR program on a serial processor,
- the code of interest is the top of the subroutine hierarchy.
- The function
- main()
- first calls
- get_args()
- to parse any command line options, then calls
- rt_dirbuild()
- to acquaint \fBlibrt\fR with the model database, and
- view_init()
- to initialize the application
- (in this case a lighting model, which may call
- mlib_init()
- to initialize the material-property library). Finally,
- rt_gettree()
- is called repeatedly to load the model treetops.
- For each frame to be produced,
- the viewing parameters are processed, and
- do_frame()
- is called.
- .PP
- Within
- do_frame(),
- per-frame initialization is handled by calling
- rt_prep(),
- mlib_setup(),
- grid_setup(),
- and
- view_2init().
- Then,
- do_run()
- is called with the linear pixel indices of the start and end locations in
- the image; typically these values are zero and width*length-1,
- except for the ensemble computer case.
- In the non-parallel cases,
- the do_run()
- routine initializes the global variables
- cur_pixel and last_pixel, and calls
- worker().
- At the end of the frame,
- view_end()
- is called to handle any final output, and print some statistics.
- .PP
- The worker() routine
- obtains the index of the next pixel that needs to be computed by
- incrementing cur_pixel, and calls
- rt_shootray()
- to interrogate the model geometry.
- view_pixel()
- is called to output the results for that pixel.
- worker()
- loops, computing one pixel at a time, until
- \fIcur_pixel > last_pixel\fR,
- after which it returns.
- .PP
- When
- rt_shootray()
- hits some geometry, it calls the a_hit() routine listed in the application
- structure to determine the final color of the pixel.
- In this case, colorview() is called. colorview() uses view_shade()
- to do the actual computation.
- Depending on the properties of the material hit and the stack of shaders
- that are being used, various material-specific renderers may be called,
- followed by a call to
- rr_render()
- if reflection or refraction is needed. Any of these routines may spawn
- multiple rays, and/or recurse on
- colorview().
- .NH 2
- Parallel Operation on Shared-Memory Machines
- .PP
- By capitalizing on the serial and concurrent
- reentrancy of the \fBlibrt\fR routines, it is very easy to take advantage
- of shared memory machines where it is possible to initiate multiple
- ``streams of execution'' or ``threads''
- within the address space of a single process.
- In order to be able to ensure that global variables are only
- manipulated by one instruction stream at a time,
- all such shared modifications are enclosed in critical sections.
- For each type of processor, it is necessary to implement the routines
- RES_ACQUIRE()
- and
- RES_RELEASE()
- to provide system-wide semaphore operations.
- When a processor acquires a resource, and any other processors need that
- same resource, they will wait until it is released, at which time exactly
- one of the waiting processors will then acquire the resource.
- .PP
- In order to minimize contention between processors
- over the critical sections of code, all critical sections are kept
- as short as possible: typically only a few lines of code.
- Furthermore, there are different semaphores for each type of resource
- accessed in critical sections.
- .I res_syscall
- is used to interlock all UNIX
- system calls and some library routines,
- such as
- \fBwrite\fR(2),
- \fBmalloc\fR(3),
- \fBprintf\fR(3),
- etc.
- .I res_worker
- is used by the function
- worker()
- to serialize access to the variable
- cur_pixel,
- which contains the index of the next pixel to be computed.
- .I res_results
- is used by the function
- view_pixel
- to serialize access to the result buffer. This is necessary because
- few processors have hardware multi-processor interlocking on
- byte operations within the same word.
- .I res_model
- is used by the \fBlibspl\fR spline library routines to serialize operations
- which cause the model to be further refined during the raytracing process,
- so that data structures remain consistent.
- .PP
- Application of the usual client-server model of computing would
- suggest that one stream of execution would be dedicated to dispatching the
- next task, while the rest of the streams of execution would be used for
- ray-tracing computations. However, in this case, the dispatching operation
- is trivial and a ``self-dispatching'' algorithm is used,
- with a critical section
- used to protect the shared variable
- cur_pixel.
- The real purpose of the function
- do_run()
- is to perform whatever machine-specific operation is required to
- initiate
- .I npsw
- streams of execution within the address space of the \fBrt\fR program, and
- then to have each stream call the function
- worker(),
- each with appropriate local stack space.
- .PP
- Each worker() function will loop until no more pixels remain,
- taking the next available pixel index.
- For each pass through the loop, RES_ACQUIRE(res_worker) will be
- used to acquire the semaphore, after which the index of the next
- pixel to be computed, \fIcur_pixel\fR,
- will be acquired and incremented, and before the semaphore is released,
- ie,
- .nf
- .sp .5
- worker() {
- while(1) {
- RES_ACQUIRE( &rt_g.res_worker );
- my_index = cur_pixel++;
- RES_RELEASE( &rt_g.res_worker );
- if( my_index > last_pixel )
- break;
- a.a_x = my_index%width;
- a.a_y = my_index/width;
- ...compute ray parameters...
- rt_shootray( &a );
- }
- }
- .fi
- .PP
- On the Denelcor HEP H-1000 each word of memory has a full/empty tag bit
- in addition to 64 data bits.
- RES_ACQUIRE is implemented using the
- Daread()
- primitive, which uses the hardware capability to wait until the
- semaphore word is full, then read it, and mark it as empty.
- RES_RELEASE is implemented using the
- Daset()
- primitive, which marks the word as full.
- do_run()
- starts additional streams of execution using the
- Dcreate(worker)
- primitive, which creates another stream which immediately calls the
- worker()
- function.
- .PP
- On the Alliant FX/8, RES_ACQUIRE is implemented using the hardware
- instruction test-and-set (TAS) which tests a location for being
- zero.
- As an atomic operation,
- if the location is zero, it sets it non-zero and
- sets the condition codes appropriately. RES_ACQUIRE embeds this
- test-and-set instruction in a polling loop to wait for acquisition of
- the resource.
- RES_RELEASE just zeros the
- semaphore word. Parallel execution is achieved by using the
- hardware capability to spread a loop across multiple processors,
- so a simple loop from 0 to 7 which calls
- worker()
- is executed in hardware concurrent mode. Each concurrent instance
- of worker() is given a separate stack area in the ``cactus stack''.
- .PP
- On the Cray X-MP and Cray-2, the Cray multi-tasking library is used.
- RES_ACQUIRE maps into LOCKON, and RES_RELEASE maps into LOCKOFF,
- while
- do_run()
- just calls
- TSKSTART(worker)
- to obtain extra workers.
- .NH 2
- Performance Measurements
- .PP
- An important part of the BRL CAD Package is a set of five benchmark model
- databases and associated viewing parameters, which permit the relative
- performance of different computers and configurations to be made using
- a significant production program as the basis of comparison. For the
- purposes of this paper, just the "Moss" database will be used for
- comparison. Since this benchmark generates pixels the fastest, it will
- place the greatest demands on any parallel processing scheme. The
- benchmark image is computed at 512x512 resolution.
- .PP
- The relative performance figures for running \fBrt\fR in the parallel mode
- with Release 1.20 of the BRL CAD Package are presented below. The
- Alliant FX/8 machine was brl-vector.arpa, configured with 8
- Computational Elements (CEs), 6 68012 Interactive Processors (IPs), 32
- Mbytes of main memory, and was running Concentrix 2.0, a port of 4.2 BSD
- UNIX.
- The Cray X-MP/48 machine was brl-patton.arpa, serial number 213, with
- 4 processors, 8 Mwords of main memory, with a clock period
- of 8.5 ns, and UNICOS 2.0, a port of System V UNIX.
- Unfortunately, no comprehensive results are available for the
- Denelcor HEP, the only other parallel computer known to have run this code.
- .TS
- center box;
- c s s s s
- n n n n n.
- Parallel \fBrt\fR Speedup -vs- # of Processors
- _
- # Processors FX/8 (eff) X-MP/48 (eff)
- _
- 1 1.00 100% 1.00 100%
- 2 1.84 92.0% 1.99 99.5%
- 3 2.79 93.0% 2.96 98.7%
- 4 3.68 92.0% 3.86 96.5%
- 5 4.80 96.0%
- 6 5.70 95.0%
- 7 6.50 92.9%
- 8 7.46 93.3%
- .TE
- .PP
- The multiple-processor performance of \fBrt\fR increases nearly
- linearly for shared memory machines
- with small collections of processors.
- The slight speedup of the Alliant when the fifth processor is added
- comes from the fact that the first four processors share one cache
- memory, while the second four share a second cache memory.
- To date, \fBrt\fR holds the record for the best achieved speedup for parallel
- processing on both the Cray X-MP/48 and the Alliant.
- Measurements on the HEP, before
- it was dismantled, indicated that near-linear improvements continued through
- 128 streams of execution. This performance is due to the fact that
- the critical sections are very small, typically just a few lines of code,
- and that they account for an insignificant portion of the computation time.
- When \fBrt\fR is run in parallel and the number of processors is increased,
- the limit to overall performance will be determined
- by the total bandwidth of the shared memory, and by memory conflicts over
- popular regions of code and data.
- .NH 1
- DISTRIBUTED COMPUTATION ON LOOSELY-COUPLED ENSEMBLES OF PROCESSORS
- .PP
- The basic assumption of this design is that network bandwidth is
- modest, so that the number of bytes and packets of overhead should not exceed
- the number of bytes and packets of results.
- The natural implementation would be to provide
- a remote procedure call (RPC) interface to
- rt_shootray(), so that when additional subsidiary rays are needed,
- more processors could potentially be utilized. However, measurements
- of this approach on VAX, Gould, and Alliant computers indicates that
- the system-call and communications overhead is comparable to the
- processing time for one ray/model intersection calculation.
- This much overhead rules out the RPC-per-ray interface for practical
- implementations.
- On some tightly coupled
- ensemble computers, there might be little penalty for such an approach,
- but in general, some larger unit of work must be exchanged.
- .PP
- It was not the intention of the author to develop another protocol
- for remote file access, so the issue of distributing the model database
- to the \fBrtsrv\fR server machines is handled outside of the context of
- the \fBremrt\fR and \fBrtsrv\fR software.
- In decreasing order of preference, the
- methods for model database distribution that are currently used
- are Sun NFS, Berkeley \fBrdist\fR, Berkeley \fBrcp\fR,
- and ordinary DARPA \fBftp\fR.
- Note that the binary databases need to be converted to a portable
- format before they are transmitted across the network, because \fBrtsrv\fR
- runs on a wide variety of processor types.
- Except for the model databases
- and the executable code of the \fBrtsrv\fR server process itself,
- no file storage is used on any of the server machines.
- .NH 2
- Distribution of Work
- .PP
- The approach used in \fBremrt\fR involves a single dispatcher
- process, which communicates with an arbitrary number of server
- processes.
- Work is assigned in groups of scanlines.
- As each server finishes a scanline,
- the results are sent back to the dispatcher,
- where they are stored.
- Completed scanlines are removed from the list of scanlines to be done
- and from the list of scanlines currently assigned to that server.
- Different servers may be working on entirely different frames.
- Before a server is assigned scanlines from a new frame, it is sent
- a new set of options and viewpoint information.
- .PP
- The underlying communications layer used
- is the package (PKG) protocol, provided by the \fBlibpkg\fR library,
- so that all communications are known to be
- reliable, and communication disruptions are noticed.
- Whenever the
- dispatcher is notified by the \fBlibpkg\fR routines that contact with a server
- has been lost, all unfinished scanlines assigned to that server will be
- requeued at the head of the ``work to do'' queue, so that it will be
- assigned to the very next available server, allowing tardy scanlines to
- be finished quickly.
- .NH 2
- Distribution Protocol
- .PP
- When a server process \fBrtsrv\fR is started, the host name of the machine
- running the dispatcher process is given as a command line argument.
- The server process can be started from a command in the dispatcher
- \fBremrt\fR, which uses \fBsystem\fR(3)
- to run the \fBrsh\fR program, or directly via some other mechanism.
- This avoids the need to register the \fBrtsrv\fR program as a system network
- daemon and transfers issues of access control, permissions, and
- accounting onto other, more appropriate tools. Initially, the \fBrtsrv\fR
- server initiates a PKG connection to the dispatcher process and then
- enters a loop reading commands from the dispatcher. Some commands
- generate no response at all, some generate one response message, and
- some generate multiple response messages. However, note that the server
- does not expect to receive any additional messages from the dispatcher
- until after it has finished processing a request, so that requests do
- not have to be buffered in the server. While this simplifies the code,
- it has some performance implications, which are discussed later.
- .PP
- In the first stage, the message received must be of type MSG_START, with
- string parameters specifying the pathname of the model database and the
- names of the desired treetops. If all goes well, the server responds
- with a MSG_START message, otherwise diagnostics are returned as string
- parameters to a MSG_PRINT message and the server exits.
- .PP
- In the second
- stage, the message received must be of type MSG_OPTIONS or MSG_MATRIX.
- MSG_OPTIONS specifies the image size and shape, hypersampling, stereo
- viewing, perspective -vs- ortho view, and control of randomization
- effects (the ``benchmark'' flag), using the familiar UNIX command line
- option format. MSG_MATRIX contains the 16 ASCII floating point numbers
- for the 4x4 homogeneous transformation matrix which represents the
- desired view.
- .PP
- In the third stage, the server waits for messages of type MSG_LINES,
- which specify the starting and ending scanline to be processed.
- As each scanline is completed, it is immediately sent back to the
- dispatcher process to minimize the amount of computation that could
- be lost in case of server failure or communications outage.
- Each scanline is returned in a message of type MSG_PIXELS. The first
- two bytes of that message contain the scanline number in
- network order 16-bit binary.
- Following that is the 3*width bytes of RGB
- data that represents the scanline.
- When all the scanlines specified in the MSG_LINES command are processed,
- the server again waits for another message, either another MSG_LINES
- command or a MSG_OPTIONS/MSG_MATRIX command to specify a new view.
- .PP
- At any time, a MSG_RESTART message can be received by the server, which
- indicates that it should close all it's files and
- immediately re-\fBexec\fR(2) itself, either to prepare for processing
- an entirely new model, or as an error recovery aid.
- A MSG_LOGLVL message can be received at any time, to enable and disable
- the issuing of MSG_PRINT output.
- A MSG_END message suggests that the server should commit suicide,
- courteously.
- .NH 2
- Dispatching Algorithm
- .PP
- The dispatching (scheduling) algorithm revolves around two main lists,
- the first being a list of currently connected servers and
- the second being a list of frames still to be done. For each unfinished
- frame, a list of scanlines remaining to be done is maintained.
- For each server, a list of the currently assigned scanlines is kept.
- Whenever a server returns a scanline, it is removed from the list of
- scanlines assigned to that server, stored in the output image,
- and also in the optional attached framebuffer.
- (It can be quite entertaining to watch the scanlines racing up the screen,
- especially when using processors of significantly different speeds).
- If the arrival of this scanline completes a frame, then the
- frame is written to disk on the dispatcher machine, timing data is computed,
- and that frame
- is removed from the list of work to be done.
- .PP
- When a server finishes the last scanline of its assignment and more
- work remains to be done, the list of unfinished frames is searched and
- the next available increment of work is assigned. Work is assigned in
- blocks of consecutive scanlines, up to a per-server maximum assignment size.
- The block of scanlines is recorded as the server's new assignment and
- is removed from the list of work to be done.
- .NH 2
- Reliability Issues
- .PP
- If the \fBlibpkg\fR communications layer looses contact with a server machine,
- or if \fBremrt\fR is commanded to drop a server, then the scanlines remaining
- in the assignment are requeued at the head of the list of scanlines remaining
- for that frame. They are placed at the head of the list so that the first
- available server will finish the tardy work, even if it had gone ahead to
- work on a subsequent frame.
- .PP
- Presently, adding and dropping server machines is a manual (or script
- driven) operation. It would be desirable to develop a separate
- machine-independent network mechanism
- that \fBremrt\fR could use to inquire about the current loading and availability
- of server machines, but this has not been done.
- This would permit periodic status requests to be made
- and automatic reacquisition of eligible server machines
- could be attempted.
- Peterson's Distrib
- .[
- Peterson Distributed Computation
- .]
- System incorporates this as a built-in part of the distributed computing
- framework, but it seems that using an independent transaction-based
- facility such as Pistritto's Host Monitoring Protocol (HMP) facility
- .[
- Natalie Muuss BRL VAX UNIX Manual
- .]
- would be a more general solution.
- .PP
- If the dispatcher fails, all frames that have not been completed
- are lost; on restart,
- execution resumes at the beginning of the first uncompleted frame.
- By carefully choosing a machine that has excellent reliability to
- run the dispatcher on, the issue of dispatcher failure can be largely
- avoided. However, typically no more than two frames will be lost,
- minimizing the impact. For frames that take extremely long times to
- compute, it would be reasonable extend the dispatcher to snapshot the work
- queues and partially assembled frames in a disk file, to permit operation
- to resume from the last ``checkpoint''.
- .NH 2
- Distributed remrt Performance
- .PP
- Ten identical Sun-3/50 systems were used
- to test the performance of \fBremrt\fR.
- All had 68881 floating point units and 4 Mbytes of memory,
- and all were in normal timesharing mode, unused except for
- running the tests and the slight overhead
- imposed by /etc/update, \fBrwhod\fR, etc.
- To provide a baseline performance figure for comparison, the
- benchmark image was computed in the normal way using \fBrt\fR, to
- avoid any overhead which might be introduced by \fBremrt\fR.
- The elapsed time to execute the ray-tracing portion of the benchmark
- was 2639 seconds; the preparation phase was not included, but amounted
- to only a few seconds.
- .TS
- center box;
- c s s s s s s
- l c s | c s | c c
- l c c | c c | c c
- l n n | n n | n n.
- \fBremrt\fR Speedup -vs- # of Processors
- _
- Ratios Elapsed Seconds
- # CPUs Theory Sun-3/50 Theory Sun-3/50 Total Speedup Efficiency
- _
- 1 1.0000 1.0072 2639.0 2658 0.993 99.3%
- 2 0.5000 0.5119 1319.5 1351 1.953 97.7%
- 3 0.3333 0.3357 879.6 886 2.979 99.3%
- 4 0.2500 0.2524 659.7 666 3.949 98.7%
- 5 0.2000 0.2027 527.8 535 4.916 98.3%
- _
- 6 0.1666 0.1686 429.8 445 5.910 98.5%
- 7 0.1429 0.1470 377.0 388 6.778 96.8%
- 8 0.1250 0.1266 329.9 334 7.874 98.4%
- 9 0.1111 0.1133 293.2 299 8.796 97.7%
- 10 0.1000 0.1019 263.9 269 9.777 97.8%
- .TE
- .PP
- The ``speedup'' figure of 0.993 for 1 CPU shows the loss of performance
- of 0.7% introduced by the overhead of
- the \fBremrt\fR to \fBrtsrv\fR communications,
- versus the non-distributed \fBrt\fR performance figure. The primary result of note
- is that the speedup of the \fBremrt\fR network distributed application is
- very close to the theoretical maximum speedup, with a total efficiency
- of 97.8% for the ten Sun case!
- The very slight loss of performance noticed (2.23%)
- is due mostly to ``new assignment latency'',
- discussed further below. Even so, it is worth noting that the speedup
- achieved by adding processors with \fBremrt\fR was even better than the
- performance achieved by adding processors in parallel mode with \fBrt\fR.
- This effect is due mostly to the lack of memory and semaphore contention
- between the \fBremrt\fR machines.
- .PP
- Unfortunately, time did not permit configuring and testing
- multiple Alliants running \fBrtsrv\fR in full parallel mode,
- although such operation is supported by \fBrtsrv\fR.
- .PP
- When \fBremrt\fR is actually being used for producing images, many different
- types of processors can be used together. The aggregate performance of
- all the available machines on a campus network is truly awesome,
- especially when a Cray or two is included! Even in this
- case, the network bandwidth required does not exceed the capacity of an
- Ethernet (yet). The bandwidth requirements are sufficiently small that
- it is practical to run many \fBrtsrv\fR processes distributed over the
- ARPANET/MILNET. On one such occasion in early 1986, 13 Gould PN9080
- machines were used all over the east coast to finish some images for a
- publication deadline.
- .NH 2
- Performance Issues
- .PP
- The policy of making work assignments in terms of multiple adjacent
- scanlines
- reduces
- the processing requirements of the dispatcher and also improves the
- efficiency of the servers.
- As a server finishes a scanline, it can give the scanline to
- the local operating system to send to the dispatcher machine,
- while the server
- continues with the computation, allowing the transmission to be
- overlapped with more computation. When gateways and wide-area networks
- are involved (with their accompanying increase in latency and packet loss),
- this is an important consideration. In the current
- implementation, assignments are always blocks of three scanlines
- because there is no general way for the \fBrtsrv\fR process to know what kind
- of machine it is running on and how fast it is likely to go. Clearly,
- it would be worthwhile to assign larger blocks of scanlines to the
- faster processors so as to minimize idle time and control traffic overhead.
- Seemingly the best way to determine this would be to measure the rate of
- scanline completion and dynamically adjust the allocation size.
- This is not currently implemented.
- .PP
- By increasing the scanline block assignment size for the faster
- processors, the amount of time the server spends waiting for a new
- assignment (termed ``new assignment latency'') will be diminished, but
- not eliminated. Because the current design assumes that the server will
- not receive another request until the previous request has been fully
- processed, no easy solution exists. Extending the server implementation
- to buffer at least one additional request would permit this limitation
- to be overcome, and the dispatcher would then have the option of sending
- a second assignment before the first one had completed, to always keep
- the server ``pipeline'' full. For the case of very large numbers of
- servers, this pipelining will be important to keep delays in the
- dispatcher from affecting performance. In the case of very fast servers,
- pipelining will be important in achieving maximum server utilization, by
- overcoming network and dispatcher delays.
- .PP
- To obtain an advantage from the pipeline effect of the multiple scanline
- work assignments, it is important that the network implementations
- in both the servers and the dispatcher have adequate buffering to
- hold an entire scanline (typically 3K bytes). For the dispatcher,
- it is a good idea to increase the default TCP receive space (and thus
- the receive window size) from 4K bytes to 16K bytes. For the server
- machines, it is a good idea to increase the default TCP transmit space
- from 4K bytes to 16K bytes. This can be accomplished by modifying
- the file /sys/netinet/tcp_usrreq.c to read:
- .sp .5
- .nf
- int tcp_sendspace = 1024*16;
- int tcp_recvspace = 1024*16;
- .fi
- .sp .5
- or to make suitable modifications to the binary image of the kernel
- using \fIadb\fR(1):
- .sp .5
- .nf
- adb -w -k /vmunix
- tcp_sendspace?W 0x4000
- tcp_recvspace?W 0x4000
- .fi
- .sp .5
- .PP
- The dispatcher process must maintain an active network connection to
- each of the server machines. In all systems there is some limit to the
- number of open files that a single process may use (symbol NOFILE); in
- 4.3 BSD UNIX, the limit is 64 open files. For the current implementation,
- this places an upper bound on the number of servers that can be used.
- As many campus networks have more than 64 machines available at night,
- it would be nice if this limit could be eased. One approach is
- to increase the limit on the dispatcher machine. Another approach is to
- implement a special ``relay server'' to act as a fan-in/fan-out
- mechanism, although the additional latency could get to be an issue.
- A third approach is to partition the problem at a higher level.
- For example, having the east campus do the beginning of a movie,
- and the west campus do the end would reduce the open file problem.
- Additionally, if gateways are involved, partitioning the problem may be kinder
- to the campus network.
- .NH 2
- Conclusions
- .PP
- Parallel computing is good.
- .PP
- This paper has shown how it is possible to implement good graphics
- interfaces within the confines of a single uniprocessor machine.
- With the adoption of
- a ``software tools'' approach when providing data handling capabilities,
- it was shown how to transparently take advantage of multi-processor
- machines, and thus realize a speed advantage.
- Furthermore, the careful selection of file formats permitted
- realizing a further speed advantage in the accomplishment of a single task
- by utilizing multiple systems located across a network.
- .PP
- Carying the theme of increased speed further,
- multi-processor sytems were examined as a vehicle for making single
- image generation tools operate faster.
- An important result was to note that
- when operation in a shared memory parallel environment was an initial design
- goal, the implementation of concurrently reentrant code did not significantly
- increase the complexity of the software. Having code with such properties allows
- direct utilization of nearly any shared memory multiprocessor with
- a minimum of system-specific support, namely the RES_ACQUIRE and RES_RELEASE
- semaphore operations, and some mechanism for starting multiple streams
- of execution within the same address space.
- .PP
- Finally, collections of processors connected only by conventional
- speed network links are considered as the final
- environment for making a single tool operate faster using multiprocessing.
- It was shown that
- network distributed computing need not be inefficient or difficult.
- The protocol and dispatching mechanism described in the preceding
- sections has been shown to be very effective at taking the
- computationally intensive task of generating ray-traced images and
- distributing it across multiple processors connected only by a
- communications network. There are a significant number of other
- application programs that could directly utilize the techniques and
- control software implemented in \fBremrt\fR to achieve network distributed
- operation. However, the development and operation of this type of
- program is still a research effort; the technology
- is not properly packaged for widespread, everyday use.
- Furthermore, it is clear that the techniques used in \fBremrt\fR are not
- sufficiently general to be applied to all scientific problems. In
- particular, problems where each ``cell'' has dependencies on some or
- all of the neighboring cells will require different techniques.
- .PP
- Massive proliferation of computers is a trend that is likely to continue
- through the 1980s into the 1990s and beyond. Developing software to
- utilize significant numbers of network connected processors is the
- coming challenge. This paper has presented a strategy that meets this
- challenge, and provides a simple, powerful, and efficient method for
- distributing a significant family of scientific analysis codes across
- multiple computers.
- .SH
- Acknowledgements
- .PP
- The author would like to thank
- Dr. Paul Deitz for providing his unflagging support and encouragement,
- Phil Dykstra, Paul Stay, Gary Moss, Chuck Kennedy, and Bob Reschly
- for the long hours as we designed and built this software,
- and Dr. Dave Rogers for once again
- persuading me to write it all down.
- .PP
- The following strings which have been included in this paper
- are known to enjoy protection as trademarks;
- the trademark ownership is acknowledged in the table below.
- .TS
- center;
- l l.
- Trademark Trademark Owner
- _
- Cray Cray Research, Inc
- Ethernet Xerox Corporation
- FX/8 Alliant Computer Systems Corporation
- IBM 370 International Business Machines Corporation
- Macintosh Apple Computer, Inc
- MacPaint Apple Computer, Inc
- NFS Sun Microsystems, Inc
- PowerNode Gould, Inc
- ProNet Proteon, Inc
- Sun Workstation Sun Microsystems, Inc
- SunView Sun Microsystems, Inc
- UNIX AT&T Bell Laboratories
- UNICOS Cray Research, Inc
- VAX Digital Equipment Corporation
- VaxStation Digital Equipment Corporation
- X Window System Massachusetts Institute of Technology
- .TE
-