Tpetra::DistObject: Evaluate host-pinned buffers for CUDA + MPI
Created by: mhoemmen
Tpetra::DistObject currently offers two options for CUDA + MPI:
- Give CudaSpace buffers to MPI
- Give HostSpace buffers to MPI
(DistObject lets subclasses pack and unpack wherever they like, as long as they update the sync state correctly.)
We want to evaluate a third option: Give
CudaHostPinnedSpace buffers to MPI. Host-pinned memory behaves like CudaUVMSpace, in the sense that both host and device can access it. However, MPI sees host-pinned memory as host memory. This means two things:
- MPI need not be CUDA aware in order to access host-pinned memory, yet we can still pack and unpack on device.
- We don't have to worry about CUDA-aware MPI being slow.
The latter is important, since we've observed some "CUDA-aware MPI" implementations being slow in practice.
Host-pinned memory has a high allocation cost. It may make sense to start with the static View allocation functions in PR #4734. We can't use those without further work, because each DistObject instance will need its own pack and unpack buffers. This may call for a simple memory pool, and for changes to DistObject and/or subclasses so that they only hold buffer allocations while communication is active.
Definition of Done
- Write a Tpetra test that prototypes use of CudaHostPinnedSpace for communication.
- Write a benchmark to compare performance of point-to-point communication with CudaHostPinnedSpace vs. CudaSpace communication buffers.
- If CudaHostPinnedSpace pays off, change DistObject to use it for communication buffers.
- Evaluate performance with a Tpetra benchmark.
- If performance is good, deploy solution in Tpetra.
- Is blocked by #4734