Tpetra::CrsGraph::getLocalDiagOffsets: Kokkos-ize
Created by: mhoemmen
This blocks #41 (closed), but is only a serious blocker if the matrix's graph structure changes frequently. This depends on #205 (closed).
It's OK for Tpetra::CrsGraph::getLocalDiagOffsets
only to have a full Kokkos parallelization (e.g., on CUDA device) if the graph is fillComplete. If not, either the existing sequential code or a host execution space parallelization suffice. Thus, the code in CrsGraph should branch between these two cases.
Ifpack2 should only exercise the fillComplete case, otherwise it's a performance bug for Ifpack2 and possibly a correctness bug for the application. (Giving a non-fillComplete matrix to Ifpack2 is wrong, because one should always fillComplete the matrix before handing it off to a solver. If the matrix has never been fillComplete before, calling fillComplete could change the offsets of the diagonal entries, which is what getLocalDiagOffsets computes.)
Take care to make the whole Kokkos::StaticCrsGraph memory-unmanaged before entering the Kokkos kernel. Otherwise, it will be slow with non-Cuda execution spaces, due to memory management overhead.