Ifpack2: Improve block Jacobi performance

Created by: mhoemmen

@trilinos/ifpack2 @trilinos/tpetra @amklinv

We want to improve performance of block Jacobi by at least 2x. "Block Jacobi" is what happens when one gives a Tpetra::Experimental::BlockCrsMatrix to Ifpack2::Relaxation and asks for Jacobi. The current implementation of block Jacobi lives in Ifpack2_Relaxation_def.hpp, in the ApplyInverseJacobi_BlockCrsMatrix method.

There are two optimizations that come to mind:

If the initial guess vector is zero (ZeroStartingSolution_ is true), then on the first (zeroth) sweep, we can replace the sparse mat-vec (blockMat->apply(Y, A_times_Y), line 1166) with a "block scaling" by the inverse of the block diagonal. Compare to non-block Jacobi in lines 1076-1100 of the same file (ApplyInverseJacobi method).
Currently, we apply the inverse of the block diagonal, by using linear solves (GETRS). It's likely more efficient to precompute the inverse (GETRI) and apply it directly (GEMV). I think this for the following reasons. First, batched GEMV is a simpler kernel and thus easier to optimize. It may even have vendor implementations. Second, not passing the pivots around means less data movement. Third, not passing the pivots around makes interfaces simpler, and thus easier to optimize.

All other optimizations should wait until we know how many sweeps the users want. Block Jacobi relies on BlockCrsMatrix::apply (sparse mat-vec) after the first sweep. The logical way to improve performance for multiple sweeps would be to make sparse mat-vec faster. With thread parallelization, we will also need to consider the update and "block scaling" kernels (compare to lines 1111 and 1112 of nonblock Jacobi). It may also pay to merge the sparse mat-vec, update, and block scaling kernels into a single kernel, like we do with block Gauss-Seidel.