Tpetra: Make CrsMatrix do thread-parallel pack & unpack
Created by: mhoemmen
@trilinos/tpetra Story: #797
See notes on #802 (closed). Try to share as much code with that solution as possible. For example, it would make sense to have a single pack function, and let callers decide whether they want to pack PIDs.
Steps:
-
Remove Teuchos::ArrayView from pack & unpack (see #229); use raw pointers and/or Kokkos::View all the way through -
Host-only thread parallelization, const graph only -
Host+GPU thread parallelization, const graph only
It's much harder to do this for dynamic graph, so we can skip that for now.
Thread parallelization of unpack should be over rows, so we should not need atomic updates when updating values in the matrix.
For the host-only thread parallelization of unpack, that's a single parallel_scan over local (row) indices to get offsets from byte counts of the unpack buffer. In the 'final' pass of the scan, actually unpack the data.
For pack, we first have to change packRow so that it goes directly to the KokkosSparse::CrsMatrix if that exists, rather than going through the "generic" getLocalRowView / getGlobalRowView interfaces that return Teuchos::ArrayView instances.