Some of these are configurable via the configure script and it is hoped that in time more and more of them will be.
This document is intended to describe some of these constants so that LAM users can experiment with tuning the MPI library. It also provides some description of the transport layer internals which may help LAM users better understand the behavior and performance they see from the LAM MPI library.
The crossover point from short to long message is configurable in each
transport. See the transport specific section
tcp, sysv or usysv
for further information.
Shortcircuit send/receive
Typically when a message is sent or received LAM MPI creates a request
structure, fills it with information about the message, links the
request into a list of messages and calls into a progression "engine" to
effect the data transfer.
When there are no active requests and a blocking (standard mode) send or receive is done the overhead of creating the request and linking it into the list can be bypassed (shortcircuited) and the progression "engine" called directly to effect the transfer.
This optimization had not been tested as thoroughly as we would like so a config option is provided to disable it. If you suspect that the optimization may be causing problems you can disable it with the --without-shortcircuit option to the configure script.
Tcp transport
The crossover point from short to long message is configurable via the
constant TCPSHORTMSGLEN in rpi.tcp.h. It can also be
set from the configure script via the
--with-tcp-short option.
The default is 64 KB.
In these transports processes on different nodes communicate via TCP sockets. The crossover point from short to long messages for these communications is configurable via the constant TCPSHORTMSGLEN. It can also be set from the configure script via the --with-tcp-short option. The default is 64 KB.
Processes located on the same node communicate via shared memory. The transport allocates one SYSV shared segment shared by all processes in the task which are on the node. This segment is logically divided into two areas.
The postbox area contains postboxes for short message communication. A postbox is used for communication one-way between two processes. The space allocated per postbox is SHMSHORTMSGLEN + CACHELINESIZE. SHMSHORTMSGLEN is configurable (via the configure option --with-shm-short or editing rpi.shm.h. It is the the crossover point from short to long messages in shared memory communication and the default value is 8 KB.
CACHELINESIZE must be the size of a cache line or a multiple thereof. The default setting is 64 bytes. You shouldn't need to change it. CACHELINESIZE bytes in the postbox are used for a cache-line sized synchronization location.
The size of the postbox area is np (np-1) (SHMSHORTMSGLEN + CACHELINESIZE) bytes.
The rest of the shared memory area is used as a global pool from which space for long message transfers is allocated. Allocation from this pool is locked. The default lock mechanism is a SYSV semaphore but the configure option --with-pthread-lock can be used to change this to a process shared pthread mutex lock. The size of this pool is configurable via the constant LAM_MPI_SHMPOOLSIZE and by the configure option --with-shm-poolsize.
The configure script will try and determine a size for the pool if none is explicitly specified. You should always check this to see if it is reasonable. Larger values should improve performance especially when an application passes large messages but will also increase the system resources used by each task.
The total size of the shared segment allocated is 2 CACHELINESIZE + LAM_MPI_SHMPOOLSIZE + np (np-1) (SHMSHORTMSGLEN + CACHELINESIZE). The 2 CACHELINESIZE bytes are for the global pool lock.
To prevent a single large message transfer from monopolizing the global pool allocations from it are actually restricted to a maximum of LAM_MPI_SHMMAXALLOC bytes. Even with this restriction it is possible for the global pool to temporarily become exhausted. In this case the transport will fall back to using the postbox area to transfer the message. Performance will be degraded but the application will progress.
LAM_MPI_SHMMAXALLOC is configurable via the configure option --with-shm-maxalloc or editing rpi.shm.h.
Both transports use a few SYSV semaphores for synchronizing the deallocation of shared structures or for synchronizing access to the shared pool.
The usysv transport should be superior to the sysv transport on multiprocessors. On uniprocessors which is better depends on the OS and the means used for processor yielding. On a Linux uniprocessor for example using semaphores (sysv transport) appears to be vastly superior to spin-locking.
The use of select to yield can be forced by the --with-select-yield option to the configure script.