Created by: kyungjoo-kim
To improve the line solver performance of SPARC, I made some changes especially for the problem sizes of interest (approximately 4 GB usage of device memory).
- In general, it uses a larger team size. As the target problem size is smaller, most line kernels cannot fill the whole gpu unit. Increasing team size, it can bring up more concurrency.
- ExtractAndFactorize: now extract routine from tpetra block crs matrix has different loop order to reduce memory transactions.
- ComputeResidual: as we use Tpetra BlockCrs format, we cannot really expect high degree of interleaved memory access. So, I moved the parallel loop one-level up and coasen the parallelism with atomic add.
- BlockJacobi: block jacobi is used when a line has a unit length. Previously, a numeric phase factorize and a solve phase applies forward/backward solves. Now when a line has a unit length, the numeric phase invert diagonals and the solve phase just apply gemv.
- KokkosBatched files: I also include batched header files that requires for this update.
- @kliegeois This also include the fix for the typo of compact mkl.
#4388 #4584 (closed)
How Has This Been Tested?
Tested on Kokkos-dev-2 and bowman.