KokkosKernels: Optimize BCRS mat-vec for GPUs and CPU SIMD

Created by: mhoemmen

@trilinos/tpetra

Our fix for #178 (closed) includes only a one-level thread parallelization of BlockCrsMatrix (BCRS) matrix-vector multiply. It runs on all supported platforms (including GPUs), but needs further optimization, both for GPUs and for non-GPU vectorization. At least, we need to plug in vendor libraries, if they are well optimized (esp. for block sizes that are not powers of two).

What does "well optimized" mean? The lower bar should be "matches performance of mat-vec with the same sparse matrix, stored in (non-block) CRS format." That is, we compare against Christian's hand-optimized (non-block) CRS kernel, by taking the block matrix and storing it in non-block format. When performance matches, we have achieved the lower bar.

This is a lower bar because it inflates total per-MPI-process memory usage of the matrix by M_crs / M_bcrs, where M_crs = Z * (sizeof(LO)*b^2 + sizeof(Scalar)*b^2) + (m-1)*sizeof(offset_type) and M_bcrs = Z * (sizeof(LO) + sizeof(Scalar)*b^2) + (m-1)*sizeof(offset_type). In these formulae, Z is the number of matrix entries on that process (counting each entry in a block as separate), m is the number of rows on that process, and LO ("local ordinal") is the type of each column index. Matrix-vector multiply must read all the matrix's data once, so this is a bandwidth-based lower bound (for a kernel that should be bandwidth bound when the problem does not fit in cache).

This is a reasonable lower bar because, for typical Scalar = double and LO = int, M_crs / M_bcrs < 2. This also suggests a stopping criterion for optimization. It may be possible for BCRS mat-vec to get a higher fraction of memory bandwidth peak than CRS mat-vec, because it involves more contiguous, direct memory loads. Thus, I am open to suggestions for a better stopping criterion.