Tpetra::MultiVector::reduce broken if getStride() > getLocalLength()

Created by: mhoemmen

@trilinos/tpetra @trilinos/belos @cgcgcg

Current Behavior

If a Tpetra::MultiVector has getStride() > getLocalLength(), then reduce() gives incorrect results.

Motivation and Context

I discovered this while working on a fix for #4626 (closed), a Belos performance issue on GPUs. My original attempted fix created MultiVectors from DualViews with stride(1) > extent(0). The issue manifested as some Belos tests failing. It turns out that no Tpetra tests must have been exercising reduce() with MultiVectors with this property.

Possible Solution

I have a fix ready.

Steps to Reproduce

Create Kokkos::DualView dv_orig with M + S rows and N columns, where M, S, and N are positive integers.
auto dv = Kokkos::subview (dv_orig, std::pair<size_t, size_t> (0, M), Kokkos::ALL ());
Create Tpetra::MultiVector with a locally replicated Map (M rows per process, over MPI_COMM_WORLD) and dv.
Call reduce() on the MultiVector. The results are wrong, even in a non-CUDA build.

Related Issues

Related to #4626 (closed), #4633 (closed)