Tpetra::CrsMatrix: Overlap communication & computation in apply()
Created by: mhoemmen
Overlap communication & computation in the apply() method of Tpetra::CrsMatrix, which implements sparse matrix-vector multiply.
If we fix #439, then we can fix this issue without needing to change CrsGraph semantics. In particular, CrsGraph could still compute its Import from the domain Map to the (entire, with locals too) column Map, and code that relies on this Import could work unchanged. Sparse matrix-vector multiply implementations, such as those in CrsMatrix and BlockCrsMatrix (see #424), could then do coarse-grained overlap as follows:
- Start a nonblocking Import of the remotes
- Import the locals (if necessary)
- Do the local part of the mat-vec
- Finish the nonblocking Import of the remotes
- Do the remote part of the mat-vec (in place, in the row Map vector -- hence coarse-grained overlap)
- Do an analogous procedure to overlap the Export, if an Export is needed (if row Map != range Map)
The "if necessary" remark on Step 2 relates to whether the domain Map is "fitted" to the column Map (see #437 (closed) for a definition of "fitted"). If so, the local entries of the input vector would not need to be copied (see #435 and #436). If not, they would need to be copied (and/or permuted), but this copy could be per process. For example, processes with the same number of local entries and no entries that need permutation, would not need to make a copy: the local part of the mat-vec could just take the original input (multi)vector pointer as input. (In fact, the domain Map need not even be fitted; that's sufficient but not necessary.)
This approach has the following benefits over one that uses a different Import than the CrsGraph's domain -> column Map Import:
- CrsGraph's Import retains its current meaning
- Neither the graph nor the matrix would need to compute a new Import object just for the remotes
- This approach would work for any domain and column Maps, and in fact for any range and row Maps
In particular, this approach would work regardless of whether the domain Map is fitted to the column Map. The graph or matrix would not need to do any extra all-reduces to figure out if the Maps are fitted on all processes.