Tpetra::DistObject: Evaluate host-pinned buffers for CUDA + MPI

Created by: mhoemmen

@trilinos/tpetra

Tpetra::DistObject currently offers two options for CUDA + MPI:

Give CudaSpace buffers to MPI
Give HostSpace buffers to MPI

(DistObject lets subclasses pack and unpack wherever they like, as long as they update the sync state correctly.)

We want to evaluate a third option: Give CudaHostPinnedSpace buffers to MPI. Host-pinned memory behaves like CudaUVMSpace, in the sense that both host and device can access it. However, MPI sees host-pinned memory as host memory. This means two things:

MPI need not be CUDA aware in order to access host-pinned memory, yet we can still pack and unpack on device.
We don't have to worry about CUDA-aware MPI being slow.

The latter is important, since we've observed some "CUDA-aware MPI" implementations being slow in practice.

Host-pinned memory has a high allocation cost. It may make sense to start with the static View allocation functions in PR #4734. We can't use those without further work, because each DistObject instance will need its own pack and unpack buffers. This may call for a simple memory pool, and for changes to DistObject and/or subclasses so that they only hold buffer allocations while communication is active.

Definition of Done

Write a Tpetra test that prototypes use of CudaHostPinnedSpace for communication.
Write a benchmark to compare performance of point-to-point communication with CudaHostPinnedSpace vs. CudaSpace communication buffers.
If CudaHostPinnedSpace pays off, change DistObject to use it for communication buffers.
Evaluate performance with a Tpetra benchmark.
If performance is good, deploy solution in Tpetra.

Related Issues

Is blocked by #4734