home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
gondwana.ecr.mu.oz.au/pub/
/
Graphics.tar
/
Graphics
/
papers
/
Muuss.parallel.Z
/
Muuss.parallel
Wrap
Text File
|
1991-02-06
|
44KB
|
1,046 lines
.\" indxbib bib
.\" refer -e -p bib a | pic | tbl out1 | /usr/5bin/eqn -Ti300 | \
.\" /usr/5lib/troff -Ti300 -ms - | dimp -t | qpr -q im217-raw
.\"
.so /usr/lib/tmac/tmac.srefs
.EQ
delim $$
.EN
.\"gsize 24
.RP
.TL
Excerpts from
.br
``Workstations, Networking,
.br
Distributed Graphics,
.br
and
.br
Parallel Processing''
.AU
Michael John Muuss
.AI
Leader, Advanced Computer Systems Team
Ballistic Research Laboratory
Aberdeen Proving Ground
Maryland 21005-5066 USA
.AB
.LP
The process of design is iterative in nature;
designs are formulated, analyzed, and improved, until the goals are met.
Modern graphics workstations provide a powerful platform for
performing detailed design and limited analysis of solid model designs,
with faster computers accessed via network links for full resolution analysis.
A broad spectrum of analysis tools exist, and most express their
output in graphical form.
Their three main styles of graphics display will be examined, along
with a look at the underlying software mechanisms to support them.
.LP
Several enabling technologies
are required for analysis tools to run on one computer,
with graphical display on another computer, including the
network transparent framebuffer capability.
The importance of portable external data representations will be
reviewed, and several specific
external data representations will be examined in significant detail.
By carefully dividing application software into appropriate tools
and connecting them with UNIX pipes, a measure of parallel processing
can be achieved within one system.
In a tool oriented environment
with machine independent data formats,
network distributed computation can be accomplished.
.LP
The next step is to
make a single tool execute faster using parallel processing.
Many analysis codes are implemented using the ray-tracing paradigm
which is ideal for execution in parallel on
tightly coupled shared-memory multiprocessors and loosely
coupled ensembles of computers.
Both are exploited, using different
mechanisms.
The strategies used for operating
on shared-memory multiprocessors such as the Denelcor HEP, Alliant FX/8, and
Cray X-MP will be presented, along with measured performance data.
.LP
The strategies used for dividing the work among network connected
loosely coupled processors (each of which may themselves be a parallel
processor)
are presented, including details
of the dispatching algorithm, and the design of the distribution protocol.
The performance issues of this type of parallel processing will be presented,
including a
set of measured speeds on a variety of hardware.
.AE
.RT
.\" Make troff use nroff style bracketed references
.ds [. " [
.ds .] ]
.NH 1
(Early Chapers Omitted)
.NH 1
SHARED-MEMORY PARALLEL PROCESSING
.PP
In the preceding section, we have seen how
different processors
can be harnessed to achieve a single goal.
The discussion so far has focused on using multiple processors
(a) within a single, multi-CPU system, through the use of UNIX pipes,
and (b) by distributing different tools in a pipeline to different
machines on the network.
This section extends the investigation into parallel processing
one level further, to harness the power of multiple processor
machines to make a single application run faster.
For the purposes of this discussion, the application to be
parallelized will be a ray-tracing program, but the techniques
developed here are quite general.
.NH 2
The Need for Speed
.PP
Images created using ray-tracing have a reputation for consuming large
quantities of computer time. For complex models,
10 to 20 hours of processor time to render a single frame
on a DEC VAX-11/780 class machine is not uncommon.
Using the ray-tracing paradigm for engineering analysis
.[
Muuss Understanding Solid Models
.]
often requires many times more processing than rendering
a view of the model. Examples of such engineering analyses
include the predictive calculation of radar cross-sections,
heat flow, and bi-static laser reflectivity. For models of
real-world geometry, running these analyses approaches
the limits of practical execution times, even with modern SuperComputers.
There are three main strategies that are being employed to
attempt to decrease the amount of elapsed time it takes to
ray-trace a particular scene:
.IP 1)
Advances in algorithms for ray-tracing.
Newer techniques in partitioning space
.[
Kaplan Space-Tracing
.]
and in taking advantage of ray-to-ray coherence
.[
Arvo Ray Classification
.]
promise to continue to yield algorithms that do fewer and fewer
ray/object intersections which do not contribute to the final results.
Significant work remains to be done in this area, and an order of
magnitude performance gain remains to be realized. However, there
is a limit to the gains that can be made in this area.
.IP 2)
Acquiring faster processors. A trivial method for decreasing the elapsed
time to run a program is to purchase a faster computer. However, even
the fastest general-purpose computers such as the Cray X-MP and Cray-2
do not execute fast enough to permit practical analysis of all
real-world models in appropriate detail. Furthermore, the speed of light
provides an upper bound on the fastest computer that can be built out of
modern integrated circuits; this is already a significant
factor in the Cray X-MP and Cray-2 processors, which operate with 8.5 ns
and 4.3 ns clock periods respectively.
.IP 3)
Using multiple processors to solve a single problem. By engaging the
resources of multiple processors to work on a single problem, the
speed-of-light limit can be circumvented. However, the price is that
explicit attention must be paid to the distribution of data to the
various processors, synchronization of the computations, and collection
of the results.
.PP
Parallel processing is still a relatively young art, and presently there
is only limited support available for
the automatic parallelization of existing
code, with newer vendors like Alliant leading the crowd.
For now, there are few general techniques
for taking programs intended
for serial operation on a single processor, and automatically adapting
them for operation on multiple processors.
.[
Ohr Minisupercomputers Speed
.]
The \fBWorm\fR program developed at Xerox PARC
.[
Shoch Worm Distributed Computation
.]
is one of the earliest known network image-rendering applications.
More recently at Xerox PARC, Frank Crow has attempted to distribute
the rendering of a single image across multiple processors,
.[
Crow Distributed Execution Work in Progress
.]
but discovered that communication overhead and synchronization problems
limited parallelism to about 30% of the available processing power.
A good summary of work to date has been collected by Peterson.
.[
Peterson Distributed Computation
.]
.PP
Ray-tracing analysis of a model has the very nice property that
the computations for each ray/model intersection are entirely
independent of other ray/model intersection calculations.
Therefore, it is easy to see how the calculations for each ray
can be performed by separate, independent processors. The
underlying assumption is that each processor has read-only access
to the entire model database.
While it would be possible to partition the ray-tracing algorithm
in such a way as to require only a portion of the model database
being resident in each processor, this would significantly increase
the complexity of the implementation as well as the
amount of synchronization and control traffic needed.
Such a partitioning has therefore
not yet been seriously attempted.
.PP
It is the purpose of the research reported in the rest of this paper to
explore the performance limits of parallel operation of ray-tracing
algorithms where available processor memory is not a limitation.
While it is not expected that this research will result
in a general purpose technique for distributing arbitrary programs
across multiple processors, the issues of the control and distribution of work
and providing reliable results in a potentially unreliable system
are quite general. The techniques used here are likely to be
applicable to a large set of other applications.
.NH 2
Raytracing Background
.PP
The origins of modern ray-tracing come from work at MAGI
under contract to BRL, initiated in the early 1960s.
The initial results were reported by MAGI
.[
geometric description technique MAGI
.]
in 1967.
Extensions to the early developments were undertaken by a
DoD Joint Technical Coordinating Group effort, resulting in
publications in 1970
.[
MAGIC User Manual
.]
and 1971.
.[
MAGIC Analyst Manual
.]
A detailed presentation of the
fundamental analysis and implementation of the ray-tracing algorithm
can be found in these two documents.
Also see
.[
Appel shading machine renderings solids
.]
.PP
More recently, interest in ray-tracing developed in the academic
community, with Kay's
.[
Kay Ray Tracing 1979
.]
thesis in 1979
being a notable early work.
One of the central papers in the ray-tracing literature is
the work of Whitted.
.[
Whitted Improved Illumination Model
.]
Model sampling techniques can be improved to provide substantially
more realistic images by using the ``Distributed Ray Tracing'' strategy.
.[
Cook Carpenter Distributed Ray Tracing
.]
For an excellent, concise discussion of ray-tracing,
consult pages 363-381 of Rogers.
.[
Rogers Procedural Elements
.]
.PP
There are several implementation strategies for interrogating
the model by computing ray/geometry intersections.
The traditional approach has been batch-oriented,
with the user defining a set of ``viewing angles'',
turning loose a big batch job to compute all the ray intersections,
and then post-processing all the ray data into some meaningful form.
However, the major drawback of this approach is that the application
has no dynamic control over ray paths, making another batch run
necessary for each level of reflection, etc.
.PP
In order to be successful, applications need: (1) dynamic control of ray
paths, to naturally implement reflection, refraction, and fragmentation into
multiple subsidiary rays, and (2) the ability to fire rays in arbitrary
directions from arbitrary points.
Nearly all non-batch ray-tracing implementations have a specific closely
coupled application (typically a model of illumination), which
allows efficient and effective control of the ray paths.
However, the most flexible approach is to implement the ray-tracing
capability as a general-purpose library, to make the functionality
available to any application as needed, and
this is the approach taken in the BRL CAD Package.
.[
Muuss CAD Package Release 1.21
.]
The ray-tracing library is called \fBlibrt\fR, while the ray-tracing
application of interest here (an optical spectrum lighting model)
is called \fBrt\fR.
.NH 2
The Structure of librt
.PP
In order to give all applications dynamic control over
the ray paths, and to allow the rays to be fired in arbitrary directions
from arbitrary points, BRL has implemented its third generation
ray-tracing capability as a set of library routines.
\fBlibrt\fR exists to allow application programs to
intersect rays with model geometry. There are four parts to the
interface: three preparation routines and the actual ray-tracing routine.
The first routine which must be called is
rt_dirbuild(), which opens the database file, and builds the
in-core database table of contents.
The second routine to be called is rt_gettree(), which
adds a database sub-tree to the active model space.
rt_gettree() can be called multiple times
to load different parts of the database
into the active model space.
The third routine is rt_prep(), which
computes the space partitioning data structures and does other
initialization chores.
Calling this routine is optional,
as it will be called by rt_shootray() if needed.
rt_prep() is provided as a separate routine to
allow independent timing of the preparation and ray-tracing phases of
applications.
.PP
To compute the intersection of a ray with the geometry in the active
model space, the application must call rt_shootray() once for each
ray. Ray-path selection for perspective, reflection, refraction, etc,
is entirely determined by the application program. The only parameter
to the rt_shootray() is a \fBlibrt\fR ``application'' structure, which
contains five major elements: the vector a_ray.r_pt which
is the starting point of the ray to be fired, the vector a_ray.r_dir
which is the unit-length direction vector of the ray,
the pointer *a_hit() which is the address of an application-provided
routine to call when the ray intersects the model geometry, the pointer
*a_miss() which is the address of an application-provided routine to
call when the ray does not hit any geometry, the flag a_onehit which is
set non-zero to stop ray-tracing as soon as the ray has intersected at
least one piece of geometry (useful for lighting models), plus various
locations for each application to store state (recursion level, colors,
etc). Note that the integer returned from the application-provided
a_hit()/a_miss() routine is the formal return of the function
rt_shootray(). The rt_shootray() function is prepared for full recursion
so that the a_hit()/a_miss() routines can themselves fire additional
rays by calling rt_shootray() recursively before deciding their own
return value.
.PP
In addition, the function rt_shootray() is serially and concurrently
reentrant, using only registers, local variables allocated on the stack, and
dynamic memory allocated with rt_malloc().
The rt_malloc() function serializes calls to \fBmalloc\fR(3).
By having the ray-tracing library fully prepared to run
in parallel with other instances of itself in the same address space,
applications may take full advantage of parallel hardware capabilities,
where such capabilities exist.
.NH 2
A Sample Ray-Tracing Program
.PP
A simple application program that fires one ray at a model and prints
the result is included below, to demonstrate the simplicity of the
interface to \fBlibrt\fR.
.LP
.sp .5
.nf
.ne 2i
#include <brlcad/raytrace.h>
struct application ap;
main() {
rt_dirbuild("model.g");
rt_gettree("car");
rt_prep();
ap.a_point = [ 100, 0, 0 ];
ap.a_dir = [ -1, 0, 0 ];
ap.a_hit = &hit_geom;
ap.a_miss = &miss_geom;
ap.a_onehit = 1;
rt_shootray( &ap );
}
hit_geom(app, part)
struct application *app;
struct partition *part;
{
printf("Hit %s", part->pt_forw->pt_regionp->reg_name);
}
miss_geom(){
printf("Missed");
}
.fi
.NH 2
Normal Operation: Serial Execution
.PP
When running the \fBrt\fR program on a serial processor,
the code of interest is the top of the subroutine hierarchy.
The function
main()
first calls
get_args()
to parse any command line options, then calls
rt_dirbuild()
to acquaint \fBlibrt\fR with the model database, and
view_init()
to initialize the application
(in this case a lighting model, which may call
mlib_init()
to initialize the material-property library). Finally,
rt_gettree()
is called repeatedly to load the model treetops.
For each frame to be produced,
the viewing parameters are processed, and
do_frame()
is called.
.PP
Within
do_frame(),
per-frame initialization is handled by calling
rt_prep(),
mlib_setup(),
grid_setup(),
and
view_2init().
Then,
do_run()
is called with the linear pixel indices of the start and end locations in
the image; typically these values are zero and width*length-1,
except for the ensemble computer case.
In the non-parallel cases,
the do_run()
routine initializes the global variables
cur_pixel and last_pixel, and calls
worker().
At the end of the frame,
view_end()
is called to handle any final output, and print some statistics.
.PP
The worker() routine
obtains the index of the next pixel that needs to be computed by
incrementing cur_pixel, and calls
rt_shootray()
to interrogate the model geometry.
view_pixel()
is called to output the results for that pixel.
worker()
loops, computing one pixel at a time, until
\fIcur_pixel > last_pixel\fR,
after which it returns.
.PP
When
rt_shootray()
hits some geometry, it calls the a_hit() routine listed in the application
structure to determine the final color of the pixel.
In this case, colorview() is called. colorview() uses view_shade()
to do the actual computation.
Depending on the properties of the material hit and the stack of shaders
that are being used, various material-specific renderers may be called,
followed by a call to
rr_render()
if reflection or refraction is needed. Any of these routines may spawn
multiple rays, and/or recurse on
colorview().
.NH 2
Parallel Operation on Shared-Memory Machines
.PP
By capitalizing on the serial and concurrent
reentrancy of the \fBlibrt\fR routines, it is very easy to take advantage
of shared memory machines where it is possible to initiate multiple
``streams of execution'' or ``threads''
within the address space of a single process.
In order to be able to ensure that global variables are only
manipulated by one instruction stream at a time,
all such shared modifications are enclosed in critical sections.
For each type of processor, it is necessary to implement the routines
RES_ACQUIRE()
and
RES_RELEASE()
to provide system-wide semaphore operations.
When a processor acquires a resource, and any other processors need that
same resource, they will wait until it is released, at which time exactly
one of the waiting processors will then acquire the resource.
.PP
In order to minimize contention between processors
over the critical sections of code, all critical sections are kept
as short as possible: typically only a few lines of code.
Furthermore, there are different semaphores for each type of resource
accessed in critical sections.
.I res_syscall
is used to interlock all UNIX
system calls and some library routines,
such as
\fBwrite\fR(2),
\fBmalloc\fR(3),
\fBprintf\fR(3),
etc.
.I res_worker
is used by the function
worker()
to serialize access to the variable
cur_pixel,
which contains the index of the next pixel to be computed.
.I res_results
is used by the function
view_pixel
to serialize access to the result buffer. This is necessary because
few processors have hardware multi-processor interlocking on
byte operations within the same word.
.I res_model
is used by the \fBlibspl\fR spline library routines to serialize operations
which cause the model to be further refined during the raytracing process,
so that data structures remain consistent.
.PP
Application of the usual client-server model of computing would
suggest that one stream of execution would be dedicated to dispatching the
next task, while the rest of the streams of execution would be used for
ray-tracing computations. However, in this case, the dispatching operation
is trivial and a ``self-dispatching'' algorithm is used,
with a critical section
used to protect the shared variable
cur_pixel.
The real purpose of the function
do_run()
is to perform whatever machine-specific operation is required to
initiate
.I npsw
streams of execution within the address space of the \fBrt\fR program, and
then to have each stream call the function
worker(),
each with appropriate local stack space.
.PP
Each worker() function will loop until no more pixels remain,
taking the next available pixel index.
For each pass through the loop, RES_ACQUIRE(res_worker) will be
used to acquire the semaphore, after which the index of the next
pixel to be computed, \fIcur_pixel\fR,
will be acquired and incremented, and before the semaphore is released,
ie,
.nf
.sp .5
worker() {
while(1) {
RES_ACQUIRE( &rt_g.res_worker );
my_index = cur_pixel++;
RES_RELEASE( &rt_g.res_worker );
if( my_index > last_pixel )
break;
a.a_x = my_index%width;
a.a_y = my_index/width;
...compute ray parameters...
rt_shootray( &a );
}
}
.fi
.PP
On the Denelcor HEP H-1000 each word of memory has a full/empty tag bit
in addition to 64 data bits.
RES_ACQUIRE is implemented using the
Daread()
primitive, which uses the hardware capability to wait until the
semaphore word is full, then read it, and mark it as empty.
RES_RELEASE is implemented using the
Daset()
primitive, which marks the word as full.
do_run()
starts additional streams of execution using the
Dcreate(worker)
primitive, which creates another stream which immediately calls the
worker()
function.
.PP
On the Alliant FX/8, RES_ACQUIRE is implemented using the hardware
instruction test-and-set (TAS) which tests a location for being
zero.
As an atomic operation,
if the location is zero, it sets it non-zero and
sets the condition codes appropriately. RES_ACQUIRE embeds this
test-and-set instruction in a polling loop to wait for acquisition of
the resource.
RES_RELEASE just zeros the
semaphore word. Parallel execution is achieved by using the
hardware capability to spread a loop across multiple processors,
so a simple loop from 0 to 7 which calls
worker()
is executed in hardware concurrent mode. Each concurrent instance
of worker() is given a separate stack area in the ``cactus stack''.
.PP
On the Cray X-MP and Cray-2, the Cray multi-tasking library is used.
RES_ACQUIRE maps into LOCKON, and RES_RELEASE maps into LOCKOFF,
while
do_run()
just calls
TSKSTART(worker)
to obtain extra workers.
.NH 2
Performance Measurements
.PP
An important part of the BRL CAD Package is a set of five benchmark model
databases and associated viewing parameters, which permit the relative
performance of different computers and configurations to be made using
a significant production program as the basis of comparison. For the
purposes of this paper, just the "Moss" database will be used for
comparison. Since this benchmark generates pixels the fastest, it will
place the greatest demands on any parallel processing scheme. The
benchmark image is computed at 512x512 resolution.
.PP
The relative performance figures for running \fBrt\fR in the parallel mode
with Release 1.20 of the BRL CAD Package are presented below. The
Alliant FX/8 machine was brl-vector.arpa, configured with 8
Computational Elements (CEs), 6 68012 Interactive Processors (IPs), 32
Mbytes of main memory, and was running Concentrix 2.0, a port of 4.2 BSD
UNIX.
The Cray X-MP/48 machine was brl-patton.arpa, serial number 213, with
4 processors, 8 Mwords of main memory, with a clock period
of 8.5 ns, and UNICOS 2.0, a port of System V UNIX.
Unfortunately, no comprehensive results are available for the
Denelcor HEP, the only other parallel computer known to have run this code.
.TS
center box;
c s s s s
n n n n n.
Parallel \fBrt\fR Speedup -vs- # of Processors
_
# Processors FX/8 (eff) X-MP/48 (eff)
_
1 1.00 100% 1.00 100%
2 1.84 92.0% 1.99 99.5%
3 2.79 93.0% 2.96 98.7%
4 3.68 92.0% 3.86 96.5%
5 4.80 96.0%
6 5.70 95.0%
7 6.50 92.9%
8 7.46 93.3%
.TE
.PP
The multiple-processor performance of \fBrt\fR increases nearly
linearly for shared memory machines
with small collections of processors.
The slight speedup of the Alliant when the fifth processor is added
comes from the fact that the first four processors share one cache
memory, while the second four share a second cache memory.
To date, \fBrt\fR holds the record for the best achieved speedup for parallel
processing on both the Cray X-MP/48 and the Alliant.
Measurements on the HEP, before
it was dismantled, indicated that near-linear improvements continued through
128 streams of execution. This performance is due to the fact that
the critical sections are very small, typically just a few lines of code,
and that they account for an insignificant portion of the computation time.
When \fBrt\fR is run in parallel and the number of processors is increased,
the limit to overall performance will be determined
by the total bandwidth of the shared memory, and by memory conflicts over
popular regions of code and data.
.NH 1
DISTRIBUTED COMPUTATION ON LOOSELY-COUPLED ENSEMBLES OF PROCESSORS
.PP
The basic assumption of this design is that network bandwidth is
modest, so that the number of bytes and packets of overhead should not exceed
the number of bytes and packets of results.
The natural implementation would be to provide
a remote procedure call (RPC) interface to
rt_shootray(), so that when additional subsidiary rays are needed,
more processors could potentially be utilized. However, measurements
of this approach on VAX, Gould, and Alliant computers indicates that
the system-call and communications overhead is comparable to the
processing time for one ray/model intersection calculation.
This much overhead rules out the RPC-per-ray interface for practical
implementations.
On some tightly coupled
ensemble computers, there might be little penalty for such an approach,
but in general, some larger unit of work must be exchanged.
.PP
It was not the intention of the author to develop another protocol
for remote file access, so the issue of distributing the model database
to the \fBrtsrv\fR server machines is handled outside of the context of
the \fBremrt\fR and \fBrtsrv\fR software.
In decreasing order of preference, the
methods for model database distribution that are currently used
are Sun NFS, Berkeley \fBrdist\fR, Berkeley \fBrcp\fR,
and ordinary DARPA \fBftp\fR.
Note that the binary databases need to be converted to a portable
format before they are transmitted across the network, because \fBrtsrv\fR
runs on a wide variety of processor types.
Except for the model databases
and the executable code of the \fBrtsrv\fR server process itself,
no file storage is used on any of the server machines.
.NH 2
Distribution of Work
.PP
The approach used in \fBremrt\fR involves a single dispatcher
process, which communicates with an arbitrary number of server
processes.
Work is assigned in groups of scanlines.
As each server finishes a scanline,
the results are sent back to the dispatcher,
where they are stored.
Completed scanlines are removed from the list of scanlines to be done
and from the list of scanlines currently assigned to that server.
Different servers may be working on entirely different frames.
Before a server is assigned scanlines from a new frame, it is sent
a new set of options and viewpoint information.
.PP
The underlying communications layer used
is the package (PKG) protocol, provided by the \fBlibpkg\fR library,
so that all communications are known to be
reliable, and communication disruptions are noticed.
Whenever the
dispatcher is notified by the \fBlibpkg\fR routines that contact with a server
has been lost, all unfinished scanlines assigned to that server will be
requeued at the head of the ``work to do'' queue, so that it will be
assigned to the very next available server, allowing tardy scanlines to
be finished quickly.
.NH 2
Distribution Protocol
.PP
When a server process \fBrtsrv\fR is started, the host name of the machine
running the dispatcher process is given as a command line argument.
The server process can be started from a command in the dispatcher
\fBremrt\fR, which uses \fBsystem\fR(3)
to run the \fBrsh\fR program, or directly via some other mechanism.
This avoids the need to register the \fBrtsrv\fR program as a system network
daemon and transfers issues of access control, permissions, and
accounting onto other, more appropriate tools. Initially, the \fBrtsrv\fR
server initiates a PKG connection to the dispatcher process and then
enters a loop reading commands from the dispatcher. Some commands
generate no response at all, some generate one response message, and
some generate multiple response messages. However, note that the server
does not expect to receive any additional messages from the dispatcher
until after it has finished processing a request, so that requests do
not have to be buffered in the server. While this simplifies the code,
it has some performance implications, which are discussed later.
.PP
In the first stage, the message received must be of type MSG_START, with
string parameters specifying the pathname of the model database and the
names of the desired treetops. If all goes well, the server responds
with a MSG_START message, otherwise diagnostics are returned as string
parameters to a MSG_PRINT message and the server exits.
.PP
In the second
stage, the message received must be of type MSG_OPTIONS or MSG_MATRIX.
MSG_OPTIONS specifies the image size and shape, hypersampling, stereo
viewing, perspective -vs- ortho view, and control of randomization
effects (the ``benchmark'' flag), using the familiar UNIX command line
option format. MSG_MATRIX contains the 16 ASCII floating point numbers
for the 4x4 homogeneous transformation matrix which represents the
desired view.
.PP
In the third stage, the server waits for messages of type MSG_LINES,
which specify the starting and ending scanline to be processed.
As each scanline is completed, it is immediately sent back to the
dispatcher process to minimize the amount of computation that could
be lost in case of server failure or communications outage.
Each scanline is returned in a message of type MSG_PIXELS. The first
two bytes of that message contain the scanline number in
network order 16-bit binary.
Following that is the 3*width bytes of RGB
data that represents the scanline.
When all the scanlines specified in the MSG_LINES command are processed,
the server again waits for another message, either another MSG_LINES
command or a MSG_OPTIONS/MSG_MATRIX command to specify a new view.
.PP
At any time, a MSG_RESTART message can be received by the server, which
indicates that it should close all it's files and
immediately re-\fBexec\fR(2) itself, either to prepare for processing
an entirely new model, or as an error recovery aid.
A MSG_LOGLVL message can be received at any time, to enable and disable
the issuing of MSG_PRINT output.
A MSG_END message suggests that the server should commit suicide,
courteously.
.NH 2
Dispatching Algorithm
.PP
The dispatching (scheduling) algorithm revolves around two main lists,
the first being a list of currently connected servers and
the second being a list of frames still to be done. For each unfinished
frame, a list of scanlines remaining to be done is maintained.
For each server, a list of the currently assigned scanlines is kept.
Whenever a server returns a scanline, it is removed from the list of
scanlines assigned to that server, stored in the output image,
and also in the optional attached framebuffer.
(It can be quite entertaining to watch the scanlines racing up the screen,
especially when using processors of significantly different speeds).
If the arrival of this scanline completes a frame, then the
frame is written to disk on the dispatcher machine, timing data is computed,
and that frame
is removed from the list of work to be done.
.PP
When a server finishes the last scanline of its assignment and more
work remains to be done, the list of unfinished frames is searched and
the next available increment of work is assigned. Work is assigned in
blocks of consecutive scanlines, up to a per-server maximum assignment size.
The block of scanlines is recorded as the server's new assignment and
is removed from the list of work to be done.
.NH 2
Reliability Issues
.PP
If the \fBlibpkg\fR communications layer looses contact with a server machine,
or if \fBremrt\fR is commanded to drop a server, then the scanlines remaining
in the assignment are requeued at the head of the list of scanlines remaining
for that frame. They are placed at the head of the list so that the first
available server will finish the tardy work, even if it had gone ahead to
work on a subsequent frame.
.PP
Presently, adding and dropping server machines is a manual (or script
driven) operation. It would be desirable to develop a separate
machine-independent network mechanism
that \fBremrt\fR could use to inquire about the current loading and availability
of server machines, but this has not been done.
This would permit periodic status requests to be made
and automatic reacquisition of eligible server machines
could be attempted.
Peterson's Distrib
.[
Peterson Distributed Computation
.]
System incorporates this as a built-in part of the distributed computing
framework, but it seems that using an independent transaction-based
facility such as Pistritto's Host Monitoring Protocol (HMP) facility
.[
Natalie Muuss BRL VAX UNIX Manual
.]
would be a more general solution.
.PP
If the dispatcher fails, all frames that have not been completed
are lost; on restart,
execution resumes at the beginning of the first uncompleted frame.
By carefully choosing a machine that has excellent reliability to
run the dispatcher on, the issue of dispatcher failure can be largely
avoided. However, typically no more than two frames will be lost,
minimizing the impact. For frames that take extremely long times to
compute, it would be reasonable extend the dispatcher to snapshot the work
queues and partially assembled frames in a disk file, to permit operation
to resume from the last ``checkpoint''.
.NH 2
Distributed remrt Performance
.PP
Ten identical Sun-3/50 systems were used
to test the performance of \fBremrt\fR.
All had 68881 floating point units and 4 Mbytes of memory,
and all were in normal timesharing mode, unused except for
running the tests and the slight overhead
imposed by /etc/update, \fBrwhod\fR, etc.
To provide a baseline performance figure for comparison, the
benchmark image was computed in the normal way using \fBrt\fR, to
avoid any overhead which might be introduced by \fBremrt\fR.
The elapsed time to execute the ray-tracing portion of the benchmark
was 2639 seconds; the preparation phase was not included, but amounted
to only a few seconds.
.TS
center box;
c s s s s s s
l c s | c s | c c
l c c | c c | c c
l n n | n n | n n.
\fBremrt\fR Speedup -vs- # of Processors
_
Ratios Elapsed Seconds
# CPUs Theory Sun-3/50 Theory Sun-3/50 Total Speedup Efficiency
_
1 1.0000 1.0072 2639.0 2658 0.993 99.3%
2 0.5000 0.5119 1319.5 1351 1.953 97.7%
3 0.3333 0.3357 879.6 886 2.979 99.3%
4 0.2500 0.2524 659.7 666 3.949 98.7%
5 0.2000 0.2027 527.8 535 4.916 98.3%
_
6 0.1666 0.1686 429.8 445 5.910 98.5%
7 0.1429 0.1470 377.0 388 6.778 96.8%
8 0.1250 0.1266 329.9 334 7.874 98.4%
9 0.1111 0.1133 293.2 299 8.796 97.7%
10 0.1000 0.1019 263.9 269 9.777 97.8%
.TE
.PP
The ``speedup'' figure of 0.993 for 1 CPU shows the loss of performance
of 0.7% introduced by the overhead of
the \fBremrt\fR to \fBrtsrv\fR communications,
versus the non-distributed \fBrt\fR performance figure. The primary result of note
is that the speedup of the \fBremrt\fR network distributed application is
very close to the theoretical maximum speedup, with a total efficiency
of 97.8% for the ten Sun case!
The very slight loss of performance noticed (2.23%)
is due mostly to ``new assignment latency'',
discussed further below. Even so, it is worth noting that the speedup
achieved by adding processors with \fBremrt\fR was even better than the
performance achieved by adding processors in parallel mode with \fBrt\fR.
This effect is due mostly to the lack of memory and semaphore contention
between the \fBremrt\fR machines.
.PP
Unfortunately, time did not permit configuring and testing
multiple Alliants running \fBrtsrv\fR in full parallel mode,
although such operation is supported by \fBrtsrv\fR.
.PP
When \fBremrt\fR is actually being used for producing images, many different
types of processors can be used together. The aggregate performance of
all the available machines on a campus network is truly awesome,
especially when a Cray or two is included! Even in this
case, the network bandwidth required does not exceed the capacity of an
Ethernet (yet). The bandwidth requirements are sufficiently small that
it is practical to run many \fBrtsrv\fR processes distributed over the
ARPANET/MILNET. On one such occasion in early 1986, 13 Gould PN9080
machines were used all over the east coast to finish some images for a
publication deadline.
.NH 2
Performance Issues
.PP
The policy of making work assignments in terms of multiple adjacent
scanlines
reduces
the processing requirements of the dispatcher and also improves the
efficiency of the servers.
As a server finishes a scanline, it can give the scanline to
the local operating system to send to the dispatcher machine,
while the server
continues with the computation, allowing the transmission to be
overlapped with more computation. When gateways and wide-area networks
are involved (with their accompanying increase in latency and packet loss),
this is an important consideration. In the current
implementation, assignments are always blocks of three scanlines
because there is no general way for the \fBrtsrv\fR process to know what kind
of machine it is running on and how fast it is likely to go. Clearly,
it would be worthwhile to assign larger blocks of scanlines to the
faster processors so as to minimize idle time and control traffic overhead.
Seemingly the best way to determine this would be to measure the rate of
scanline completion and dynamically adjust the allocation size.
This is not currently implemented.
.PP
By increasing the scanline block assignment size for the faster
processors, the amount of time the server spends waiting for a new
assignment (termed ``new assignment latency'') will be diminished, but
not eliminated. Because the current design assumes that the server will
not receive another request until the previous request has been fully
processed, no easy solution exists. Extending the server implementation
to buffer at least one additional request would permit this limitation
to be overcome, and the dispatcher would then have the option of sending
a second assignment before the first one had completed, to always keep
the server ``pipeline'' full. For the case of very large numbers of
servers, this pipelining will be important to keep delays in the
dispatcher from affecting performance. In the case of very fast servers,
pipelining will be important in achieving maximum server utilization, by
overcoming network and dispatcher delays.
.PP
To obtain an advantage from the pipeline effect of the multiple scanline
work assignments, it is important that the network implementations
in both the servers and the dispatcher have adequate buffering to
hold an entire scanline (typically 3K bytes). For the dispatcher,
it is a good idea to increase the default TCP receive space (and thus
the receive window size) from 4K bytes to 16K bytes. For the server
machines, it is a good idea to increase the default TCP transmit space
from 4K bytes to 16K bytes. This can be accomplished by modifying
the file /sys/netinet/tcp_usrreq.c to read:
.sp .5
.nf
int tcp_sendspace = 1024*16;
int tcp_recvspace = 1024*16;
.fi
.sp .5
or to make suitable modifications to the binary image of the kernel
using \fIadb\fR(1):
.sp .5
.nf
adb -w -k /vmunix
tcp_sendspace?W 0x4000
tcp_recvspace?W 0x4000
.fi
.sp .5
.PP
The dispatcher process must maintain an active network connection to
each of the server machines. In all systems there is some limit to the
number of open files that a single process may use (symbol NOFILE); in
4.3 BSD UNIX, the limit is 64 open files. For the current implementation,
this places an upper bound on the number of servers that can be used.
As many campus networks have more than 64 machines available at night,
it would be nice if this limit could be eased. One approach is
to increase the limit on the dispatcher machine. Another approach is to
implement a special ``relay server'' to act as a fan-in/fan-out
mechanism, although the additional latency could get to be an issue.
A third approach is to partition the problem at a higher level.
For example, having the east campus do the beginning of a movie,
and the west campus do the end would reduce the open file problem.
Additionally, if gateways are involved, partitioning the problem may be kinder
to the campus network.
.NH 2
Conclusions
.PP
Parallel computing is good.
.PP
This paper has shown how it is possible to implement good graphics
interfaces within the confines of a single uniprocessor machine.
With the adoption of
a ``software tools'' approach when providing data handling capabilities,
it was shown how to transparently take advantage of multi-processor
machines, and thus realize a speed advantage.
Furthermore, the careful selection of file formats permitted
realizing a further speed advantage in the accomplishment of a single task
by utilizing multiple systems located across a network.
.PP
Carying the theme of increased speed further,
multi-processor sytems were examined as a vehicle for making single
image generation tools operate faster.
An important result was to note that
when operation in a shared memory parallel environment was an initial design
goal, the implementation of concurrently reentrant code did not significantly
increase the complexity of the software. Having code with such properties allows
direct utilization of nearly any shared memory multiprocessor with
a minimum of system-specific support, namely the RES_ACQUIRE and RES_RELEASE
semaphore operations, and some mechanism for starting multiple streams
of execution within the same address space.
.PP
Finally, collections of processors connected only by conventional
speed network links are considered as the final
environment for making a single tool operate faster using multiprocessing.
It was shown that
network distributed computing need not be inefficient or difficult.
The protocol and dispatching mechanism described in the preceding
sections has been shown to be very effective at taking the
computationally intensive task of generating ray-traced images and
distributing it across multiple processors connected only by a
communications network. There are a significant number of other
application programs that could directly utilize the techniques and
control software implemented in \fBremrt\fR to achieve network distributed
operation. However, the development and operation of this type of
program is still a research effort; the technology
is not properly packaged for widespread, everyday use.
Furthermore, it is clear that the techniques used in \fBremrt\fR are not
sufficiently general to be applied to all scientific problems. In
particular, problems where each ``cell'' has dependencies on some or
all of the neighboring cells will require different techniques.
.PP
Massive proliferation of computers is a trend that is likely to continue
through the 1980s into the 1990s and beyond. Developing software to
utilize significant numbers of network connected processors is the
coming challenge. This paper has presented a strategy that meets this
challenge, and provides a simple, powerful, and efficient method for
distributing a significant family of scientific analysis codes across
multiple computers.
.SH
Acknowledgements
.PP
The author would like to thank
Dr. Paul Deitz for providing his unflagging support and encouragement,
Phil Dykstra, Paul Stay, Gary Moss, Chuck Kennedy, and Bob Reschly
for the long hours as we designed and built this software,
and Dr. Dave Rogers for once again
persuading me to write it all down.
.PP
The following strings which have been included in this paper
are known to enjoy protection as trademarks;
the trademark ownership is acknowledged in the table below.
.TS
center;
l l.
Trademark Trademark Owner
_
Cray Cray Research, Inc
Ethernet Xerox Corporation
FX/8 Alliant Computer Systems Corporation
IBM 370 International Business Machines Corporation
Macintosh Apple Computer, Inc
MacPaint Apple Computer, Inc
NFS Sun Microsystems, Inc
PowerNode Gould, Inc
ProNet Proteon, Inc
Sun Workstation Sun Microsystems, Inc
SunView Sun Microsystems, Inc
UNIX AT&T Bell Laboratories
UNICOS Cray Research, Inc
VAX Digital Equipment Corporation
VaxStation Digital Equipment Corporation
X Window System Massachusetts Institute of Technology
.TE