Tpetra::Distributor: Make doPosts nonblocking

Created by: mhoemmen

@trilinos/tpetra

Epic: #767.

Tpetra::Distributor implements the MPI communication that happens in an Export or Import. It uses MPI 2-sided point-to-point communication. Its doPosts method starts the receives and sends, and its doWaits method waits on them (MPI_Waitall). Receives are nonblocking (MPI_Irecv) and sends may be either blocking (various options, but only MPI_Send is used in practice) or nonblocking (MPI_Isend). However, sends default to blocking, and this is the only completely correct path. This is because of the so-called "slow path" of doPosts.

The "slow path" comes about when the indices in a send to a particular process aren't contiguous (i.e., are interrupted by data ~~from~~ [meant for]* other process(es)). The current implementation thus requires an intermediate pack buffer in that case. It allocates the extra buffer on the spot. In order to avoid holding on to that memory, the implementation forces blocking sends in that case (it throws std::logic_error otherwise).

Two fixes come to mind:

Keep the extra buffer. Keep it in the returned CommRequest so it doesn't get deallocated.
Pre-permute the data during packing (DistObject::packAndPrepare) so the slow path never gets invoked in practice.

The first fix is easier, but may be ultimately less performant.

The "slow path" occurs in both the 3-argument (fixed # packets per index, used by Vector and MultiVector) and 4-argument (variable # packets per index, used by CrsGraph, CrsMatrix, etc.) versions of doPosts. The 3-argument version matters most for solver performance, but it's easy to do both at the same time.

[*edit by jhux2 23-Jan-2019]