Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers

Created by: mhoemmen

@trilinos/tpetra

Blocked by: #1571 (closed), #1602, #1658 (closed)

MPI on Pascal, not K80, appears to be slow when accessing UVM-allocated device buffers, but not standard CUDA device buffers. This causes trouble with Tpetra::Distributor sending and receiving packed data, as measured by @ambrad . It also probably causes trouble with dot product and norm.

@crtrott and I wrote an MPI + Kokkos benchmark to measure this independently of Tpetra.

Possible work-around: Don't use CudaUVMSpace for internal comm buffers in DistObject methods packAndPrepareNew and unpackAndCombineNew. Define a typedef pack_dev_memory_space which is CudaSpace when execution_space is Cuda.