Tpetra::DistObject: Evaluate host-pinned buffers for CUDA + MPI
Created by: mhoemmen
@trilinos/tpetra
Tpetra::DistObject
currently offers two options for CUDA + MPI:
- Give CudaSpace buffers to MPI
- Give HostSpace buffers to MPI
(DistObject lets subclasses pack and unpack wherever they like, as long as they update the sync state correctly.)
We want to evaluate a third option: Give CudaHostPinnedSpace
buffers to MPI. Host-pinned memory behaves like CudaUVMSpace, in the sense that both host and device can access it. However, MPI sees host-pinned memory as host memory. This means two things:
- MPI need not be CUDA aware in order to access host-pinned memory, yet we can still pack and unpack on device.
- We don't have to worry about CUDA-aware MPI being slow.
The latter is important, since we've observed some "CUDA-aware MPI" implementations being slow in practice.
Host-pinned memory has a high allocation cost. It may make sense to start with the static View allocation functions in PR #4734. We can't use those without further work, because each DistObject instance will need its own pack and unpack buffers. This may call for a simple memory pool, and for changes to DistObject and/or subclasses so that they only hold buffer allocations while communication is active.
Definition of Done
-
Write a Tpetra test that prototypes use of CudaHostPinnedSpace for communication. -
Write a benchmark to compare performance of point-to-point communication with CudaHostPinnedSpace vs. CudaSpace communication buffers. -
If CudaHostPinnedSpace pays off, change DistObject to use it for communication buffers. -
Evaluate performance with a Tpetra benchmark. -
If performance is good, deploy solution in Tpetra.
Related Issues
- Is blocked by #4734