Tpetra::CrsMatrix: Overlap communication & computation in apply()

Created by: mhoemmen

@trilinos/tpetra

Epic: #767.

Overlap communication & computation in the apply() method of Tpetra::CrsMatrix, which implements sparse matrix-vector multiply.

This depends on #384 working for Tpetra::MultiVector, which in turn depends on #383.

If we fix #439, then we can fix this issue without needing to change CrsGraph semantics. In particular, CrsGraph could still compute its Import from the domain Map to the (entire, with locals too) column Map, and code that relies on this Import could work unchanged. Sparse matrix-vector multiply implementations, such as those in CrsMatrix and BlockCrsMatrix (see #424), could then do coarse-grained overlap as follows:

Start a nonblocking Import of the remotes
Import the locals (if necessary)
Do the local part of the mat-vec
Finish the nonblocking Import of the remotes
Do the remote part of the mat-vec (in place, in the row Map vector -- hence coarse-grained overlap)
Do an analogous procedure to overlap the Export, if an Export is needed (if row Map != range Map)

The "if necessary" remark on Step 2 relates to whether the domain Map is "fitted" to the column Map (see #437 (closed) for a definition of "fitted"). If so, the local entries of the input vector would not need to be copied (see #435 and #436). If not, they would need to be copied (and/or permuted), but this copy could be per process. For example, processes with the same number of local entries and no entries that need permutation, would not need to make a copy: the local part of the mat-vec could just take the original input (multi)vector pointer as input. (In fact, the domain Map need not even be fitted; that's sufficient but not necessary.)

This approach has the following benefits over one that uses a different Import than the CrsGraph's domain -> column Map Import:

CrsGraph's Import retains its current meaning
Neither the graph nor the matrix would need to compute a new Import object just for the remotes
This approach would work for any domain and column Maps, and in fact for any range and row Maps

In particular, this approach would work regardless of whether the domain Map is fitted to the column Map. The graph or matrix would not need to do any extra all-reduces to figure out if the Maps are fitted on all processes.