Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers
Created by: mhoemmen
@trilinos/tpetra
Blocked by: #1571 (closed), #1602, #1658 (closed)
MPI on Pascal, not K80, appears to be slow when accessing UVM-allocated device buffers, but not standard CUDA device buffers. This causes trouble with Tpetra::Distributor sending and receiving packed data, as measured by @ambrad . It also probably causes trouble with dot product and norm.
@crtrott and I wrote an MPI + Kokkos benchmark to measure this independently of Tpetra.
Possible work-around: Don't use CudaUVMSpace for internal comm buffers in DistObject methods packAndPrepareNew and unpackAndCombineNew. Define a typedef pack_dev_memory_space which is CudaSpace when execution_space is Cuda.