Tpetra::Distributor: Fix "slow path" so we can use MPI_Isend

Created by: mhoemmen

@trilinos/tpetra @jjellio @csiefer2

Fix the "slow path" of Distributor::doPosts, so we can use nonblocking sends (MPI_Isend). The "slow path" kicks in when the data to send are not neatly grouped in contiguous chunks per process. It permutes the data into contiguous-by-target-process-rank chunks for sending. Currently, the slow path uses the same send buffer for all the messages. This means that it cannot use nonblocking sends.

We must fix both the "three-argument" (all messages have the same size) and "four-argument" (different messages may have different sizes) overloads of doPosts, and both the Teuchos::ArrayRCP and Kokkos::View versions of each.

Motivation and Context

This is part of the overall effort to improve MPI+CUDA performance and make Tpetra's boundary exchange and sparse matrix-vector multiply communication nonblocking.

Definition of Done

Fix 3-argument Teuchos::ArrayRCP overload of doPosts
Fix 3-argument Kokkos::View overload of doPosts
Fix 4-argument Teuchos::ArrayRCP overload of doPosts
Fix 4-argument Kokkos::View overload of doPosts

Related Issues

Part of #383