Tpetra BCRS: Thread-parallelize sparse matrix-vector multiply
Created by: mhoemmen
@trilinos/tpetra @trilinos/ifpack2 @crtrott @kyungjoo-kim @amklinv
Thread-parallelize the sparse matrix-vector multiply in the apply() method of Tpetra::Experimental::BlockCrsMatrix. Please interact with Ryan Eberhardt, who has an excellent CUDA implementation for column-major blocks.
It would be wise to do this in two passes. First, add a simple host execution space parallelization using a lambda. Then, implement an optimized kernel, using Ryan's as a start.
This affects Ifpack2 as well as Tpetra, because for Jacobi with > 1 sweep, Ifpack2 uses sparse mat-vec.