Tpetra::MultiVector: Unpack kernels not using correct execution space

Created by: mhoemmen

@trilinos/tpetra @trilinos/stokhos Blocks: #1088 (closed)

Tpetra::MultiVector::unpackAndCombineNew uses Kokkos parallel kernels to unpack remote data into the target MultiVector. The kernels are supposed to run on either device or host, depending on whether the target MultiVector is sync'd to device or host. However, the kernels were only running on device! This caused trouble with #1088 (closed) (which see), and may hinder performance for small MultiVectors or in other cases where users prefer to work on host.

The issue is that the kernels were using the output Kokkos::View (of the target MultiVector's data) to determine the execution space on which to run. The problem with this, is that for a Kokkos::DualView of CudaUVMSpace, the host and device Views are the same. This blocks #1088 (closed), whose fix requires that the input View (the buffer to unpack) be either a CudaSpace or a HostSpace View. Furthermore, CudaUVMSpace::execution_space == Cuda, so the kernel will always run on device, even given a HostSpace buffer to unpack.

The fix is to change these kernels to take a Kokkos execution space argument. This argument lets the user specify the execution space on which to run. It's an execution space instance, so this could give us a future option to run in a separate CUDA stream (e.g., for overlap of communication and computation), or to run on a subset of threads.

This fix also requires fixing Stokhos' specializations of the unpack kernels.