Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2016-11-04T05:30:13Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/700KokkosKernels: Add simplified kernels for integer Scalar types2016-11-04T05:30:13ZJames WillenbringKokkosKernels: Add simplified kernels for integer Scalar types*Created by: mhoemmen*
@trilinos/zoltan2 @trilinos/tpetra
Key words: build size, build time
Many Trilinos users have complained about long build times and large library and executable sizes. One of the biggest sources of this in...*Created by: mhoemmen*
@trilinos/zoltan2 @trilinos/tpetra
Key words: build size, build time
Many Trilinos users have complained about long build times and large library and executable sizes. One of the biggest sources of this in the Tpetra solver stack, is the large number of template parameter combinations for which Tpetra classes get instantiated. For example, we build all of Tpetra for Scalar = int and Scalar = GlobalOrdinal for EVERY enabled GlobalOrdinal type, as well as for the usual Scalar types like double and std::complex<double>.
Use of integer Scalar types seems a little weird. In most cases where Tpetra or downstream Trilinos packages use integer Scalar types, they use them for communication (as the source or target of an Export or Import), not for computation. This could justify refactoring Tpetra's class hierarchy into integer and non-integer "branches." However, I had a conversation with Michael Wolf about Zoltan2's needs. He explained that for some computations of metrics, Zoltan2 does sparse matrix-vector multiplies with integer Scalar types. This means that we really do need to compute with integer Scalar types. However, we don't need highly optimized kernels for integer Scalar types, as far as I know.
This suggests that we could address the problem at the KokkosKernels level, by falling back to simple kernels for integer Scalar types. This issue proposes to do just that. The kernels still need to be thread parallel, and must use CUDA appropriately. However, they don't need such heavy optimization. We can write simple one-level parallelism, for example.
Here are some build directory size statistics, for the Trilinos/packages/tpetra build directory after `make clean` and `make`, with no examples or tests enabled. I used GCC 4.7.2 on Linux, and enabled Scalar = std::complex<double>. Otherwise, I only use default settings for enabled types. (Default enabled LocalOrdinal type is int. Default enabled GlobalOrdinal types are int and long long.) __STATIC builds indicate static libraries; otherwise, I use dynamic libraries. *_DEBUG_ builds have Kokkos and Teuchos debugging features (e.g., bounds checking) turned on; __RELEASE_ builds have these debugging features turned off. I enabled only the Kokkos::OpenMP version of Tpetra (this should generate more code than the Kokkos::Serial version).
- MPI_DEBUG: 2.1 G
- MPI_DEBUG_STATIC: 11 G
- MPI_RELEASE: 187 M
- MPI_RELEASE_STATIC: 2.3 G
Do you see why we recommend dynamic libraries? ;-)
Correction: My MPI_RELEASE build is Kokkos::Serial only.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/690AllowPadding and WithoutInitializing moving to Kokkos namespace change downst...2017-10-27T04:10:00ZJames WillenbringAllowPadding and WithoutInitializing moving to Kokkos namespace change downstream Trilinos code accordingly*Created by: mhoemmen*
Kokkos/develop has moved AllowPadding and WithoutInitializing out of the Kokkos::Experimental namespace, into the Kokkos namespace. Once this gets moved into Kokkos/master and snapshotted into Trilinos, change Tr...*Created by: mhoemmen*
Kokkos/develop has moved AllowPadding and WithoutInitializing out of the Kokkos::Experimental namespace, into the Kokkos namespace. Once this gets moved into Kokkos/master and snapshotted into Trilinos, change Trilinos downstream code accordingly.
https://github.com/kokkos/kokkos/issues/325
@trilinos/tpetra
@trilinos/stokhos
@trilinos/sacado
@trilinos/shylu
@trilinos/stk
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/691KokkosKernels: One kernel instantiation for both CudaSpace and CudaUVMSpace2016-10-05T19:11:59ZJames WillenbringKokkosKernels: One kernel instantiation for both CudaSpace and CudaUVMSpace*Created by: mhoemmen*
This depends on https://github.com/kokkos/kokkos/issues/290. (See also https://github.com/kokkos/kokkos/issues/290 , marked as redundant.)
Once the above Kokkos issue is fixed, we'll be able to assign from a Cud...*Created by: mhoemmen*
This depends on https://github.com/kokkos/kokkos/issues/290. (See also https://github.com/kokkos/kokkos/issues/290 , marked as redundant.)
Once the above Kokkos issue is fixed, we'll be able to assign from a CudaUVMSpace View to a CudaSpace View. This will let us just instantiate kernels for CudaSpace.
@trilinos/tpetra
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/667KokkosKernels: Use existing macros for TPLs like MKL2017-09-01T19:25:43ZJames WillenbringKokkosKernels: Use existing macros for TPLs like MKL*Created by: mhoemmen*
@trilinos/tpetra
I'm looking in tpetra/kernels/src/stage/graph/impl/KokkosKernels_SPGEMM_mkl_impl.hpp. I notice a macro KERNELS_HAVE_MKL. This is some macro that gets defined somewhere by hand in the code. It...*Created by: mhoemmen*
@trilinos/tpetra
I'm looking in tpetra/kernels/src/stage/graph/impl/KokkosKernels_SPGEMM_mkl_impl.hpp. I notice a macro KERNELS_HAVE_MKL. This is some macro that gets defined somewhere by hand in the code. It is not plugged into the CMake build system at all. KokkosKernels already has a macro, HAVE_TPETRAKERNELS_MKL. Please use that instead. "KERNELS_HAVE_MKL" is too general of a name; it is likely to collide with other software library's macro.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/666KokkosKernels: Fix sparse matrix-matrix multiply error handling2016-11-02T17:34:54ZJames WillenbringKokkosKernels: Fix sparse matrix-matrix multiply error handling*Created by: mhoemmen*
@trilinos/tpetra
For example, if the user requests a TPL (such as CUSP), but that TPL is not installed, the implementation should throw an exception or return an error code, not just print some error message to ...*Created by: mhoemmen*
@trilinos/tpetra
For example, if the user requests a TPL (such as CUSP), but that TPL is not installed, the implementation should throw an exception or return an error code, not just print some error message to stderr. I see a similar approach to error handling when the memory space is wrong (e.g., lines 67-80 of KokkosKernels_SPGEMM_mkl_impl.hpp).
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/662KokkosKernels: Add segmented sort / sort-and-merge2016-10-26T04:10:08ZJames WillenbringKokkosKernels: Add segmented sort / sort-and-merge*Created by: mhoemmen*
See #660 for a use case. Tpetra::Crs{Graph,Matrix}::fillComplete currently needs segmented sort-and-merge, though a fix for #119 would remove the "-and-merge" requirement.
Thrust doesn't have anything like this....*Created by: mhoemmen*
See #660 for a use case. Tpetra::Crs{Graph,Matrix}::fillComplete currently needs segmented sort-and-merge, though a fix for #119 would remove the "-and-merge" requirement.
Thrust doesn't have anything like this. stable_sort_by_key() just does what Tpetra::sort2 currently does, namely apply the implicit permutation resulting from sorting keys, to a corresponding array of values.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/568Tpetra: proposed changes to handle processes with zero rows2016-09-14T17:19:30ZJames WillenbringTpetra: proposed changes to handle processes with zero rows*Created by: allaffa*
@allaffa @jhux2 @trilinos/tpetra @rstumin
I am currently working on a project at Sandia about multigrid preconditioners where some tasks have no matrix rows. This has revealed two places where KokkosKernels and T...*Created by: allaffa*
@allaffa @jhux2 @trilinos/tpetra @rstumin
I am currently working on a project at Sandia about multigrid preconditioners where some tasks have no matrix rows. This has revealed two places where KokkosKernels and Tpetra throw an exception.
I would like to propose a change of the conditions to throw the exception, checking also that the number of rows be different from zero.
See attached .txt files.
[0005-Tpetra-added-check-on-rows-in-exception.txt](https://github.com/trilinos/Trilinos/files/425981/0005-Tpetra-added-check-on-rows-in-exception.txt)
[0007-Kokkos-added-check-on-rows-in-exception.txt](https://github.com/trilinos/Trilinos/files/425980/0007-Kokkos-added-check-on-rows-in-exception.txt)
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/583Ifpack2: Plug in HTS for thread-parallel sparse triangular solve 2017-09-28T01:43:23ZJames WillenbringIfpack2: Plug in HTS for thread-parallel sparse triangular solve *Created by: mhoemmen*
@trilinos/ifpack2 @trilinos/tpetra
*Created by: mhoemmen*
@trilinos/ifpack2 @trilinos/tpetra
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/447KokkosKernels: CrsMatrix sumIntoValuesSorted minor questions2016-06-19T20:49:20ZJames WillenbringKokkosKernels: CrsMatrix sumIntoValuesSorted minor questions*Created by: mhoemmen*
@bathmatt 's commit https://github.com/trilinos/Trilinos/commit/1e65cffac05ae95bbc4e6f0d6d4d428886704834 added a sumIntoValuesSorted method to KokkosSparse::CrsMatrix. I have a few comments and questions:
1. We s...*Created by: mhoemmen*
@bathmatt 's commit https://github.com/trilinos/Trilinos/commit/1e65cffac05ae95bbc4e6f0d6d4d428886704834 added a sumIntoValuesSorted method to KokkosSparse::CrsMatrix. I have a few comments and questions:
1. We should introduce the "hint" that Epetra and Tpetra use for optimizing search for multiple column indices. It introduces an extra branch per input index, but avoids search for common cases. @etphipp first implemented it in Tpetra and found it to be useful, and `findRelOffset` (in tpetra/core/src/Tpetra_Util.hpp) does it too.
2. It's legit to use `ordinal_type` (32-bit) instead of `size_type` (64-bit on everything but CUDA) for the difference between two consecutive row offsets, as long as the row doesn't have too many duplicate entries. SparseRowView(Const) already uses `ordinal_type` for the row length, for this reason.
3. Was there a particular reason for the `hi - low > 10` cut-off, or is that just a good guess?
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/416KokkosKernels: Optimize BCRS mat-vec for GPUs and CPU SIMD2017-10-26T20:00:31ZJames WillenbringKokkosKernels: Optimize BCRS mat-vec for GPUs and CPU SIMD*Created by: mhoemmen*
@trilinos/tpetra
Our fix for #178 includes only a one-level thread parallelization of BlockCrsMatrix (BCRS) matrix-vector multiply. It runs on all supported platforms (including GPUs), but needs further optimiza...*Created by: mhoemmen*
@trilinos/tpetra
Our fix for #178 includes only a one-level thread parallelization of BlockCrsMatrix (BCRS) matrix-vector multiply. It runs on all supported platforms (including GPUs), but needs further optimization, both for GPUs and for non-GPU vectorization. At least, we need to plug in vendor libraries, if they are well optimized (esp. for block sizes that are not powers of two).
What does "well optimized" mean? The lower bar should be "matches performance of mat-vec with the same sparse matrix, stored in (non-block) CRS format." That is, we compare against Christian's hand-optimized (non-block) CRS kernel, by taking the block matrix and storing it in non-block format. When performance matches, we have achieved the lower bar.
This is a lower bar because it inflates total per-MPI-process memory usage of the matrix by `M_crs / M_bcrs`, where
`M_crs = Z * (sizeof(LO)*b^2 + sizeof(Scalar)*b^2) + (m-1)*sizeof(offset_type)`
and
`M_bcrs = Z * (sizeof(LO) + sizeof(Scalar)*b^2) + (m-1)*sizeof(offset_type)`.
In these formulae, `Z` is the number of matrix entries on that process (counting each entry in a block as separate), `m` is the number of rows on that process, and `LO` ("local ordinal") is the type of each column index. Matrix-vector multiply must read all the matrix's data once, so this is a bandwidth-based lower bound (for a kernel that should be bandwidth bound when the problem does not fit in cache).
This is a _reasonable_ lower bar because, for typical `Scalar = double` and `LO = int`, `M_crs / M_bcrs < 2`. This also suggests a stopping criterion for optimization. It may be possible for BCRS mat-vec to get a higher fraction of memory bandwidth peak than CRS mat-vec, because it involves more contiguous, direct memory loads. Thus, I am open to suggestions for a better stopping criterion.
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/340KokkosKernels: Fix ETI2017-07-31T23:45:39ZJames WillenbringKokkosKernels: Fix ETI*Created by: mhoemmen*
@trilinos/tpetra
We need to fix ETI for KokkosKernels, in order to keep build and link times down for Tpetra and downstream packages. This is especially important for link-time optimization (LTO).
A fix would b...*Created by: mhoemmen*
@trilinos/tpetra
We need to fix ETI for KokkosKernels, in order to keep build and link times down for Tpetra and downstream packages. This is especially important for link-time optimization (LTO).
A fix would be to use the same .cpp file generation approach used in TpetraCore and Ifpack2.
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/288Ifpack2: Build and test new threaded coloring Gauss-Seidel by default2017-01-06T20:29:55ZJames WillenbringIfpack2: Build and test new threaded coloring Gauss-Seidel by default*Created by: mhoemmen*
@trilinos/ifpack2 @kddevin @jdbooth
Right now, enabling Ifpack2's new threaded coloring Gauss-Seidel requires enabling "experimental" CMake options (both off by default) in both KokkosKernels and Ifpack2. It wo...*Created by: mhoemmen*
@trilinos/ifpack2 @kddevin @jdbooth
Right now, enabling Ifpack2's new threaded coloring Gauss-Seidel requires enabling "experimental" CMake options (both off by default) in both KokkosKernels and Ifpack2. It would make sense to build and make this new capability available by default, even if it is not the default Gauss-Seidel implementation.
Tpetra-FY17-Q2https://gitlab.osti.gov/jmwille/Trilinos/-/issues/226cuda build of Kokkos_Sparse_MV_impl_spmv takes 25 minutes for 1 file.2016-05-18T19:44:14ZJames Willenbringcuda build of Kokkos_Sparse_MV_impl_spmv takes 25 minutes for 1 file.*Created by: bathmatt*
Can we split up the build of this file? This is all on hansen with
-bash-4.1$ module load devpack/openmpi/1.10.0/gcc/4.8.4/cuda/7.5.18
The build of
[ 66%] Building CXX object packages/tpetra/kernels/src/CMakeF...*Created by: bathmatt*
Can we split up the build of this file? This is all on hansen with
-bash-4.1$ module load devpack/openmpi/1.10.0/gcc/4.8.4/cuda/7.5.18
The build of
[ 66%] Building CXX object packages/tpetra/kernels/src/CMakeFiles/tpetrakernels.dir/impl/Kokkos_Sparse_MV_impl_spmv_Cuda.cpp.o
[ 66%] Building CXX object packages/tpetra/kernels/src/CMakeFiles/tpetrakernels.dir/impl/Kokkos_Sparse_MV_impl_spmv_Serial.cpp.o
is very slow under cuda/debug (not -G though). By very slow, 25 minutes.
-bash-4.1$ time make -j
[ 0%] Built target kokkoscore
[ 0%] Built target kokkosalgorithms
[ 0%] Built target kokkoscontainers
[ 33%] Built target teuchoscore
[ 66%] Built target teuchosparameterlist
[ 66%] Built target teuchoscomm
[ 66%] Building CXX object packages/tpetra/kernels/src/CMakeFiles/tpetrakernels.dir/impl/Kokkos_Sparse_MV_impl_spmv_Cuda.cpp.o
[ 66%] Building CXX object packages/tpetra/kernels/src/CMakeFiles/tpetrakernels.dir/impl/Kokkos_Sparse_MV_impl_spmv_Serial.cpp.o
[ 66%] Linking CXX static library libtpetrakernels.a
[100%] Built target tpetrakernels
real 24m16.402s
user 29m35.614s
sys 0m58.005s
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/208build of Kokkos_Sparse_MV_impl_spmv_Serial.cpp.o fails if you use nvcc and ha...2016-03-19T08:19:49ZJames Willenbringbuild of Kokkos_Sparse_MV_impl_spmv_Serial.cpp.o fails if you use nvcc and have cuda disabled*Created by: bathmatt*
If I don't configure with cuda but still have OMPI_CXX=nvcc_wrapper I get the following. This is on hansen
[ 66%] Building CXX object packages/tpetra/kernels/src/CMakeFiles/tpetrakernels.dir/impl/Kokkos_Sparse_MV...*Created by: bathmatt*
If I don't configure with cuda but still have OMPI_CXX=nvcc_wrapper I get the following. This is on hansen
[ 66%] Building CXX object packages/tpetra/kernels/src/CMakeFiles/tpetrakernels.dir/impl/Kokkos_Sparse_MV_impl_spmv_Serial.cpp.o
/home/mbetten/Trilinos/Trilinos/packages/tpetra/kernels/src/impl/Kokkos_Sparse_impl_spmv.hpp(885): error: namespace "Kokkos" has no member "shfl_down"
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/178Tpetra BCRS: Thread-parallelize sparse matrix-vector multiply2016-06-02T15:57:50ZJames WillenbringTpetra BCRS: Thread-parallelize sparse matrix-vector multiply*Created by: mhoemmen*
@trilinos/tpetra @trilinos/ifpack2 @crtrott @kyungjoo-kim @amklinv
Thread-parallelize the sparse matrix-vector multiply in the apply() method of Tpetra::Experimental::BlockCrsMatrix. Please interact with Ryan E...*Created by: mhoemmen*
@trilinos/tpetra @trilinos/ifpack2 @crtrott @kyungjoo-kim @amklinv
Thread-parallelize the sparse matrix-vector multiply in the apply() method of Tpetra::Experimental::BlockCrsMatrix. Please interact with Ryan Eberhardt, who has an excellent CUDA implementation for column-major blocks.
It would be wise to do this in two passes. First, add a simple host execution space parallelization using a lambda. Then, implement an optimized kernel, using Ryan's as a start.
This affects Ifpack2 as well as Tpetra, because for Jacobi with > 1 sweep, Ifpack2 uses sparse mat-vec.