KokkosKernels: Push LocalOrdinal into single-vector kernels to save 2x instantiations

Created by: mhoemmen

@trilinos/tpetra Tpetra::Vector and Tpetra::MultiVector have a LocalOrdinal template parameter. The number of rows in a Vector or MultiVector is always supposed to fit in LocalOrdinal. Currently, Tpetra only instantiates and tests LocalOrdinal = int. However, when Tpetra uses KokkosKernels kernels to do vector operations (norms, dot products, and vector updates), these kernels compare the vectors' lengths to INT_MAX at run time and dispatch to a SizeType = int or SizeType = Vec::size_type (always size_t, except for CUDA, where it's int). This is supposed to improve performance (Intel processors especially like 32-bit loop indices). However, this doubles the number of instantiations unnecessarily. It does so even if LocalOrdinal is 64 bits (e.g., LocalOrdinal = long long or long, both signed, while size_t is typically unsigned).

I wrote "single-vector kernels" in the title, because kernels that address entries of a contiguously stored MultiVector using (i,j) indexing may need a larger SizeType than LocalOrdinal. For example, consider a 200-column MultiVector (which we may reasonably encounter in iterative solvers like GMRES) with 10^7 rows. SizeType = int will overflow in this case.

If people want to use KokkosKernels stand-alone, it may make sense for it to dispatch to different instantiations at run time, as a performance optimization. However, I think Tpetra at least needs the ability to turn off this feature at configure or compile time.