KokkosKernels, Tpetra: Finish & optimize threaded GEMV & GEMM

Created by: mhoemmen

Tpetra::MultiVector::multiply currently invokes the BLAS for GEMM and GEMV operations. Belos in turn invokes Tpetra::MultiVector::multiply for the projection and basis vector update operations in classical Gram-Schmidt, an important kernel in the GMRES iterative linear solver.

Trilinos' configuration process finds the system BLAS by default. That BLAS implementation is usually not threaded. Thus, a big part of linear solves won't get threaded.

My work-around for #243 (closed) (which see) takes the first step to solving this. However, the kernels aren't complete and have not been optimized. I'll write more here about how to do that.