Tpetra: Thread-parallelize RowMatrixTransposer (for P^T (A*P) in MueLu setup)
Created by: mhoemmen
@trilinos/tpetra @trilinos/muelu Story: #629 Blocked by: #905 (closed)
Thread-parallelize Tpetra::RowMatrixTransposer. This matters because of the P^T * (AP) kernel in MueLu setup. The transpose should normally still cost less than the sparse matrix-matrix multiply, but once we thread the latter, the transpose may become a bottleneck. It's also not hard to thread parallelize.
In sparse matrix-matrix multiply, the input Tpetra::CrsMatrix to transpose is always fill complete. Thus, we can use the KokkosSparse::CrsMatrix directly. Here's how to get the number of entries in each row of the transpose (result):
using Kokkos::Atomic;
using Kokkos::MemoryTraits;
using Kokkos::parallel_for;
using Kokkos::RangePolicy;
Kokkos::View<offset_type*, DT> counts ("counts", lclNumCols);
Kokkos::View<offset_type*, DT, MemoryTraits<Atomic> > countsAtomic = counts;
parallel_for (RangePolicy<typename DT::execution_space, LO> (0, lclNumRows),
KOKKOS_LAMBDA (const LO& lclRow) {
for (offset_type k = ptr(lclRow); k < ptr(lclRow+1); ++k) {
++countsAtomic(ind(k));
}
});