High-performance Communication

High-performance communication is critical for applications running on supercomputers. The traditional MPI everywhere (typically a process per core) model is not scaling on modern architectures. Hybrid MPI+threads (typically a process per node or socket, and one thread per core) is gaining popularity since it allows us to utilize the many cores on a node while sharing the remaining on-node resources. However, the communication performance of MPI+threads is poor especially when multiple threads are involved in communication. The reason for the poor performance stems from a quaint view, held by both MPI users and developers, of the network as a single device.

A trend that is often overlooked is the increase in network parallelism available on a node of a supercomputer. Modern network interface cards (NICs) such as Mellanox InfiniBand and Intel Omni-Path feature multiple network hardware contexts that serve as parallel interfaces to the network from a single node. The state of the art, however, is conservative in this regard. Applications typically do not expose MPI communication parallelism because MPI libraries today do not utilize such parallelism. MPI libraries, on the other hand, still employ conservative approaches, such as a global critical section and the use of only one network hardware context, because applications today do not expose any parallelism that the library can exploit.

This project investigates the above problem and proposes a fast MPI+threads library for high-performance inter-node communication.

This project is partially funded by Argonne National Laboratory via a DOE sub-contract.