KokkosKernels: Add simplified kernels for integer Scalar types

Created by: mhoemmen

@trilinos/zoltan2 @trilinos/tpetra

Key words: build size, build time

Many Trilinos users have complained about long build times and large library and executable sizes. One of the biggest sources of this in the Tpetra solver stack, is the large number of template parameter combinations for which Tpetra classes get instantiated. For example, we build all of Tpetra for Scalar = int and Scalar = GlobalOrdinal for EVERY enabled GlobalOrdinal type, as well as for the usual Scalar types like double and std::complex.

Use of integer Scalar types seems a little weird. In most cases where Tpetra or downstream Trilinos packages use integer Scalar types, they use them for communication (as the source or target of an Export or Import), not for computation. This could justify refactoring Tpetra's class hierarchy into integer and non-integer "branches." However, I had a conversation with Michael Wolf about Zoltan2's needs. He explained that for some computations of metrics, Zoltan2 does sparse matrix-vector multiplies with integer Scalar types. This means that we really do need to compute with integer Scalar types. However, we don't need highly optimized kernels for integer Scalar types, as far as I know.

This suggests that we could address the problem at the KokkosKernels level, by falling back to simple kernels for integer Scalar types. This issue proposes to do just that. The kernels still need to be thread parallel, and must use CUDA appropriately. However, they don't need such heavy optimization. We can write simple one-level parallelism, for example.

Here are some build directory size statistics, for the Trilinos/packages/tpetra build directory after make clean and make, with no examples or tests enabled. I used GCC 4.7.2 on Linux, and enabled Scalar = std::complex. Otherwise, I only use default settings for enabled types. (Default enabled LocalOrdinal type is int. Default enabled GlobalOrdinal types are int and long long.) __STATIC builds indicate static libraries; otherwise, I use dynamic libraries. *DEBUG builds have Kokkos and Teuchos debugging features (e.g., bounds checking) turned on; _RELEASE builds have these debugging features turned off. I enabled only the Kokkos::OpenMP version of Tpetra (this should generate more code than the Kokkos::Serial version).

MPI_DEBUG: 2.1 G
MPI_DEBUG_STATIC: 11 G
MPI_RELEASE: 187 M
MPI_RELEASE_STATIC: 2.3 G

Do you see why we recommend dynamic libraries? ;-)

Correction: My MPI_RELEASE build is Kokkos::Serial only.