LAM 6.2 Beta Tuning

There are various constants defined in the LAM header files which relate to message transfer protocols, shared memory allocation and so on.

Some of these are configurable via the configure script and it is hoped that in time more and more of them will be.

This document is intended to describe some of these constants so that LAM users can experiment with tuning the MPI library. It also provides some description of the transport layer internals which may help LAM users better understand the behavior and performance they see from the LAM MPI library.

Short/long protocol

LAM MPI uses a short/long message protocol. If a message is short it is sent together with a header in one transfer to the destination process. If the message is long then a header (possibly with some data) is sent to the destination and the sending process then waits for an acknowledgment from the receiver before sending the rest of the message data. The receiving process sends the acknowledgment when a matching receive is posted.

The crossover point from short to long message is configurable in each transport. See the transport specific section tcp, sysv or usysv for further information.

Shortcircuit send/receive

Typically when a message is sent or received LAM MPI creates a request structure, fills it with information about the message, links the request into a list of messages and calls into a progression "engine" to effect the data transfer.

When there are no active requests and a blocking (standard mode) send or receive is done the overhead of creating the request and linking it into the list can be bypassed (shortcircuited) and the progression "engine" called directly to effect the transfer.

This optimization had not been tested as thoroughly as we would like so a config option is provided to disable it. If you suspect that the optimization may be causing problems you can disable it with the --without-shortcircuit option to the configure script.

Tcp transport

The crossover point from short to long message is configurable via the constant TCPSHORTMSGLEN in rpi.tcp.h. It can also be set from the configure script via the --with-tcp-short option. The default is 64 KB.

Usysv and sysv transports

Configuration constants for the usysv and sysv transports are found in rpi.shm.h.

In these transports processes on different nodes communicate via TCP sockets. The crossover point from short to long messages for these communications is configurable via the constant TCPSHORTMSGLEN. It can also be set from the configure script via the --with-tcp-short option. The default is 64 KB.

Processes located on the same node communicate via shared memory. The transport allocates one SYSV shared segment shared by all processes in the task which are on the node. This segment is logically divided into two areas.

The postbox area contains postboxes for short message communication. A postbox is used for communication one-way between two processes. The space allocated per postbox is SHMSHORTMSGLEN + CACHELINESIZE. SHMSHORTMSGLEN is configurable (via the configure option --with-shm-short or editing rpi.shm.h. It is the the crossover point from short to long messages in shared memory communication and the default value is 8 KB.

CACHELINESIZE must be the size of a cache line or a multiple thereof. The default setting is 64 bytes. You shouldn't need to change it. CACHELINESIZE bytes in the postbox are used for a cache-line sized synchronization location.

The size of the postbox area is np (np-1) (SHMSHORTMSGLEN + CACHELINESIZE) bytes.

The rest of the shared memory area is used as a global pool from which space for long message transfers is allocated. Allocation from this pool is locked. The default lock mechanism is a SYSV semaphore but the configure option --with-pthread-lock can be used to change this to a process shared pthread mutex lock. The size of this pool is configurable via the constant LAM_MPI_SHMPOOLSIZE and by the configure option --with-shm-poolsize.

The configure script will try and determine a size for the pool if none is explicitly specified. You should always check this to see if it is reasonable. Larger values should improve performance especially when an application passes large messages but will also increase the system resources used by each task.

The total size of the shared segment allocated is 2 CACHELINESIZE + LAM_MPI_SHMPOOLSIZE + np (np-1) (SHMSHORTMSGLEN + CACHELINESIZE). The 2 CACHELINESIZE bytes are for the global pool lock.

Use of the global pool
When a message larger than 2 SHMSHORTMSGLEN is sent the transport sends SHMSHORTMSGLEN bytes with the first packet and when the acknowledgment is received allocates (message length - SHMSHORTMSGLEN) bytes from the global pool to transfer the rest of the message.

To prevent a single large message transfer from monopolizing the global pool allocations from it are actually restricted to a maximum of LAM_MPI_SHMMAXALLOC bytes. Even with this restriction it is possible for the global pool to temporarily become exhausted. In this case the transport will fall back to using the postbox area to transfer the message. Performance will be degraded but the application will progress.

LAM_MPI_SHMMAXALLOC is configurable via the configure option --with-shm-maxalloc or editing rpi.shm.h.

Synchronization

The usysv and sysv transports differ only in the mechanism used to synchronize the transfer of messages via shared memory. The usysv transport uses spin locks with back-off and the sysv transport uses SYSV semaphores.

Both transports use a few SYSV semaphores for synchronizing the deallocation of shared structures or for synchronizing access to the shared pool.

The usysv transport should be superior to the sysv transport on multiprocessors. On uniprocessors which is better depends on the OS and the means used for processor yielding. On a Linux uniprocessor for example using semaphores (sysv transport) appears to be vastly superior to spin-locking.

Usysv transport spin-locks
The usysv transport uses spin locks with back-off. When a process backs off it attempts to yield the processor. If the configure script found a system provided yield function such as yield() or sched_yield() this is used. If no such function is found then select on NULL file descriptor sets with a timeout of 10 us is used.

The use of select to yield can be forced by the --with-select-yield option to the configure script.

Sysv transport semaphores
The sysv transport allocates a semaphore set (of size 6) for each process pair communicating via shared memory. On some systems you may need to reconfigure the system to allow for more semaphore sets if running tasks with many processes communicating via shared memory.