KokkosKernels: Push LocalOrdinal into single-vector kernels to save 2x instantiations
Created by: mhoemmen
@trilinos/tpetra Tpetra::Vector and Tpetra::MultiVector have a LocalOrdinal
template parameter. The number of rows in a Vector or MultiVector is always supposed to fit in LocalOrdinal
. Currently, Tpetra only instantiates and tests LocalOrdinal = int
. However, when Tpetra uses KokkosKernels kernels to do vector operations (norms, dot products, and vector updates), these kernels compare the vectors' lengths to INT_MAX
at run time and dispatch to a SizeType = int
or SizeType = Vec::size_type
(always size_t
, except for CUDA, where it's int
). This is supposed to improve performance (Intel processors especially like 32-bit loop indices). However, this doubles the number of instantiations unnecessarily. It does so even if LocalOrdinal
is 64 bits (e.g., LocalOrdinal = long long
or long
, both signed, while size_t
is typically unsigned).
I wrote "single-vector kernels" in the title, because kernels that address entries of a contiguously stored MultiVector using (i,j) indexing may need a larger SizeType
than LocalOrdinal
. For example, consider a 200-column MultiVector (which we may reasonably encounter in iterative solvers like GMRES) with 10^7 rows. SizeType = int
will overflow in this case.
If people want to use KokkosKernels stand-alone, it may make sense for it to dispatch to different instantiations at run time, as a performance optimization. However, I think Tpetra at least needs the ability to turn off this feature at configure or compile time.