Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2017-10-26T20:00:31Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/416KokkosKernels: Optimize BCRS mat-vec for GPUs and CPU SIMD2017-10-26T20:00:31ZJames WillenbringKokkosKernels: Optimize BCRS mat-vec for GPUs and CPU SIMD*Created by: mhoemmen*
@trilinos/tpetra
Our fix for #178 includes only a one-level thread parallelization of BlockCrsMatrix (BCRS) matrix-vector multiply. It runs on all supported platforms (including GPUs), but needs further optimiza...*Created by: mhoemmen*
@trilinos/tpetra
Our fix for #178 includes only a one-level thread parallelization of BlockCrsMatrix (BCRS) matrix-vector multiply. It runs on all supported platforms (including GPUs), but needs further optimization, both for GPUs and for non-GPU vectorization. At least, we need to plug in vendor libraries, if they are well optimized (esp. for block sizes that are not powers of two).
What does "well optimized" mean? The lower bar should be "matches performance of mat-vec with the same sparse matrix, stored in (non-block) CRS format." That is, we compare against Christian's hand-optimized (non-block) CRS kernel, by taking the block matrix and storing it in non-block format. When performance matches, we have achieved the lower bar.
This is a lower bar because it inflates total per-MPI-process memory usage of the matrix by `M_crs / M_bcrs`, where
`M_crs = Z * (sizeof(LO)*b^2 + sizeof(Scalar)*b^2) + (m-1)*sizeof(offset_type)`
and
`M_bcrs = Z * (sizeof(LO) + sizeof(Scalar)*b^2) + (m-1)*sizeof(offset_type)`.
In these formulae, `Z` is the number of matrix entries on that process (counting each entry in a block as separate), `m` is the number of rows on that process, and `LO` ("local ordinal") is the type of each column index. Matrix-vector multiply must read all the matrix's data once, so this is a bandwidth-based lower bound (for a kernel that should be bandwidth bound when the problem does not fit in cache).
This is a _reasonable_ lower bar because, for typical `Scalar = double` and `LO = int`, `M_crs / M_bcrs < 2`. This also suggests a stopping criterion for optimization. It may be possible for BCRS mat-vec to get a higher fraction of memory bandwidth peak than CRS mat-vec, because it involves more contiguous, direct memory loads. Thus, I am open to suggestions for a better stopping criterion.
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/662KokkosKernels: Add segmented sort / sort-and-merge2016-10-26T04:10:08ZJames WillenbringKokkosKernels: Add segmented sort / sort-and-merge*Created by: mhoemmen*
See #660 for a use case. Tpetra::Crs{Graph,Matrix}::fillComplete currently needs segmented sort-and-merge, though a fix for #119 would remove the "-and-merge" requirement.
Thrust doesn't have anything like this....*Created by: mhoemmen*
See #660 for a use case. Tpetra::Crs{Graph,Matrix}::fillComplete currently needs segmented sort-and-merge, though a fix for #119 would remove the "-and-merge" requirement.
Thrust doesn't have anything like this. stable_sort_by_key() just does what Tpetra::sort2 currently does, namely apply the implicit permutation resulting from sorting keys, to a corresponding array of values.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/691KokkosKernels: One kernel instantiation for both CudaSpace and CudaUVMSpace2016-10-05T19:11:59ZJames WillenbringKokkosKernels: One kernel instantiation for both CudaSpace and CudaUVMSpace*Created by: mhoemmen*
This depends on https://github.com/kokkos/kokkos/issues/290. (See also https://github.com/kokkos/kokkos/issues/290 , marked as redundant.)
Once the above Kokkos issue is fixed, we'll be able to assign from a Cud...*Created by: mhoemmen*
This depends on https://github.com/kokkos/kokkos/issues/290. (See also https://github.com/kokkos/kokkos/issues/290 , marked as redundant.)
Once the above Kokkos issue is fixed, we'll be able to assign from a CudaUVMSpace View to a CudaSpace View. This will let us just instantiate kernels for CudaSpace.
@trilinos/tpetra
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/948KokkosKernels: SPMV performance comparison test2017-01-06T00:12:10ZJames WillenbringKokkosKernels: SPMV performance comparison test*Created by: crtrott*
This adds a performance comparison test without MPI for just SPMV kernel. Can compare the KokkosKernel function, a custom Kokkos implementation, CuSparse and MKL.
This can read MatrixMarket format, and also gener...*Created by: crtrott*
This adds a performance comparison test without MPI for just SPMV kernel. Can compare the KokkosKernel function, a custom Kokkos implementation, CuSparse and MKL.
This can read MatrixMarket format, and also generate a binary matrix storage file, which it can read again. Depending on your system reading the binary format back in can be 10x faster than the text file.
I had that written quite a while ago I just cleaned it up now. https://gitlab.osti.gov/jmwille/Trilinos/-/issues/964Kokkos::StaticCrsGraph does not use memory space template parameter passed to...2017-01-10T20:54:12ZJames WillenbringKokkos::StaticCrsGraph does not use memory space template parameter passed to KokkosSparse::CrsMatrix*Created by: mndevec*
I am providing a device to CrsMatrix in the constructor. This is different than default, where device is either Kokkos::Cuda with hostpinned space, or Kokkos::OpenMP with Kokkos::Hostspace when HBM is enabled.
...*Created by: mndevec*
I am providing a device to CrsMatrix in the constructor. This is different than default, where device is either Kokkos::Cuda with hostpinned space, or Kokkos::OpenMP with Kokkos::Hostspace when HBM is enabled.
I was expecting CrsMatrix memory to be allocated at the memory space I provide. However, this holds only for the values view, while row pointers and entries are still allocated at the default memory space of the provided execution space.
It seems that StaticCrsGraph do not take the device as template argument, instead it is provided the execution space. It creates a default device, as a result allocated memories diverge for values and entries views.
Shouldn't StaticCrsGraph take the device as template argument instead of execution space?
@srajama1 @crtrott @mhoemmen
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/981Remove Tribits dependencies in KokkosKernels2017-01-06T03:27:05ZJames WillenbringRemove Tribits dependencies in KokkosKernels*Created by: srajama1*
There are bunch of tribits related dependencies that has come into KokkosKernels. Eventually we would like a Kokkos model where we could build without Tribits and test with gtest. This needs a cleanup.
@mndevec...*Created by: srajama1*
There are bunch of tribits related dependencies that has come into KokkosKernels. Eventually we would like a Kokkos model where we could build without Tribits and test with gtest. This needs a cleanup.
@mndevec @crtrott @mhoemmen
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1176Analysis of configure and build failures due to KokkosKernels pushes on March...2017-03-23T17:54:16ZJames WillenbringAnalysis of configure and build failures due to KokkosKernels pushes on March 1-2 reported in #1099*Created by: bartlettroscoe*
**Description:**
This story is to analyze the KokkosKernels commits pushed on March 1-2, 2017 that broke the configure and build of Trilinos that was reported in #1099 and see if usage of the [checkin-tes...*Created by: bartlettroscoe*
**Description:**
This story is to analyze the KokkosKernels commits pushed on March 1-2, 2017 that broke the configure and build of Trilinos that was reported in #1099 and see if usage of the [checkin-test-sems.sh](https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing) script could have avoided the failures (and resulting consequences) if it had been used for the test and push (which would have stopped the push).
This started when a set of commits were pushed to the Trilinos 'develop' branch on March 1 with the top commit 97ed757 being:
```
97ed757 [Wed Mar 1 08:24:01 2017 -0700] <crtrott@sandia.gov>
Kokkos-Kernels: Adding Kokkos-Kernels as a stand-alone package
```
as shown by the CI build:
* http://testing.sandia.gov/cdash/viewConfigure.php?buildid=2766164
That version of Trilinos failed to configure as shown on that CI build iteration.
Later that day, issue #1099 was created by an important Trilinos customer and it resulted in 35 comments that involved 9 people in that issue before it was resolved.
An attempt to fix this problem was pushed later that day with the top commit de7ac5a being:
```
de7ac5a [Wed Mar 1 13:17:16 2017 -0700] <mhoemme@sandia.gov>
KokkosKernels: Fix #1099
```
as shown by the CI build:
* http://testing.sandia.gov/cdash/viewConfigure.php?buildid=2766462
That version passed the configure but resulted in many build failures.
The build was not finally fixed until March 2 as shown at:
* http://testing.sandia.gov/cdash/viewConfigure.php?buildid=2767860
Could the usage of the checkin-test-sems.sh script have caught these problems and stop the pushes that broke the configure and build of Trilinos over these two days?
**CC:** @trilinos/framework, @bathmatt, @crtrott, @mhoemmen
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1401Experimental RBILUK 2017-09-21T16:05:25ZJames WillenbringExperimental RBILUK *Created by: mndevec*
This issue will track the development of new RBILUK.
Goals:
1- Clear separation of the symbolic, numeric and solve phases.
2- Replacing old blas calls such as GEMM, GETF2, with the register blocking little bl...*Created by: mndevec*
This issue will track the development of new RBILUK.
Goals:
1- Clear separation of the symbolic, numeric and solve phases.
2- Replacing old blas calls such as GEMM, GETF2, with the register blocking little block implementations in KokkosKernels.
The issue will be developed in the fork:
https://github.com/mndevec/Trilinos/tree/develop
@trilinos/ifpack2 https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1622KokkosKernels: Gauss-Seidel threaded setup performance issues with Ifpack2 an...2017-09-12T19:15:52ZJames WillenbringKokkosKernels: Gauss-Seidel threaded setup performance issues with Ifpack2 and MueLu*Created by: pwxy*
I am trying to use the KokkosKernels threaded Gauss-Seidel.
I'm calling it through ifpack2 ("MT Gauss-Seidel") as a smoother for MueLu,
so the problem could be a bad interaction between KokkosKernels and ifpack2 or ...*Created by: pwxy*
I am trying to use the KokkosKernels threaded Gauss-Seidel.
I'm calling it through ifpack2 ("MT Gauss-Seidel") as a smoother for MueLu,
so the problem could be a bad interaction between KokkosKernels and ifpack2 or MueLu.
I'm running drekar on a single KNL of mutrino, with 1 MPI process, and I increase the OMP threads from 1 to 64 (1 OMP thread per core):
setup smoother (ifpack2 "MT Gauss-Seidel")
| t | solve time(s) |GS setup time(s)|
| -:| -:|-------:|
|1|33.27|493.10|
|2|24.67|286.50|
|4|12.26|157.80|
|8|6.97|79.82|
|16|3.97|36.61|
|32|3.50|24.06|
|64|3.16|16.01|
For reference, here are the times if I use the standard, non-threaded Gauss-Seidel
(but if it really is non-threaded, why is the setup time going down as the number of OMP threads is increased?)
setup smoother (ifpack2 "Gauss-Seidel")
|t|solve time(s)|GS setup time(s)|
| -:| -:|-------:|
|1|27.04|0.36|
|2|25.13|0.21|
|4|24.09|0.13|
|8|23.58|0.09|
|16|23.38|0.06|
|32|23.32|0.05|
|64|23.33|0.05|
drekar/Trilinos was built with intel 17.0.2 and gnu 6.1.0 (Trilinos repo as of August 16, 2017)
I ran vtune on ellis for the 1 OMP case (the 493.1s case above).
According to vtune, all the time is the two Kokkos::parallel_for calls in
KokkosKernels::Experimental::Util::symmetrize_graph_symbolic_hashmap (lines 1097 and 1139 of KokkosKernels_Utils.hpp)
the time is pretty much equally split between the two Kokkos::parallel_for calls
The following is the stack trace from Ifpack2:
```
Ifpack2::Relaxation::initialize()
KokkosKernels::Experimental::Graph::gauss_seidel_symbolic
KokkosKernels::Experimental::Graph::Impl::GaussSeidel
KokkosKernels::Experimental::Util::symmetrize_graph_symbolic_hashmap (Kokkos::parallel_for on line 1097 and 1139)
Kokkos::parallel_for
Kokkos::parallel_for
```
Edit (@aprokop): formattinghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1729kokkos-kernels: Turn off unneeded instantiations by default2017-10-26T21:10:54ZJames Willenbringkokkos-kernels: Turn off unneeded instantiations by default*Created by: mhoemmen*
@trilinos/tpetra
Trilinos doesn't need the following instantiations:
- `CudaSpace` (not used, Trilinos assumes UVM and uses `CudaUVMSpace`)
- Unused offset types
See `${PACKAGE_NAME}_INST_MEMSPACE_C...*Created by: mhoemmen*
@trilinos/tpetra
Trilinos doesn't need the following instantiations:
- `CudaSpace` (not used, Trilinos assumes UVM and uses `CudaUVMSpace`)
- Unused offset types
See `${PACKAGE_NAME}_INST_MEMSPACE_CUDASPACE` in `kokkos-kernels/CMakeLists.txt`. I'm not sure if Tpetra can control these directly; it may be necessary to change defaults in kokkos-kernels.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2221Pamgen,kokkos-kernels: NVCC build error with -D CMAKE_CXX_USE_RESPONSE_FILE_F...2019-03-05T22:14:10ZJames WillenbringPamgen,kokkos-kernels: NVCC build error with -D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON*Created by: mhoemmen*
kokkos-kernels requires `-D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON` for CUDA builds, even with complex arithmetic disabled.
https://github.com/trilinos/Trilinos/issues/2115#issuecomment-357750048
However...*Created by: mhoemmen*
kokkos-kernels requires `-D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON` for CUDA builds, even with complex arithmetic disabled.
https://github.com/trilinos/Trilinos/issues/2115#issuecomment-357750048
However, Pamgen does not build, given those settings.
@trilinos/kokkos-kernels @trilinos/pamgen @trilinos/framework
CC: @rrdrake @prwolfe
## Expectations
1. Whatever flags kokkos-kernels needs to build on CUDA, they need not to block building other essential packages.
2. kokkos-kernels needs to document its CMake requirements.
3. kokkos-kernels needs to fail at configure time, with an informative message, if the CMake variables that it needs are not set.
## Current Behavior
```
$ make
[ 0%] Linking CXX shared library libpamgen.so
nvcc fatal : No input files specified; use option --help for more information
make[2]: *** [packages/pamgen/src/libpamgen.so.12.13] Error 1
make[1]: *** [packages/pamgen/src/CMakeFiles/pamgen.dir/all] Error 2
make: *** [all] Error 2
```
## Motivation and Context
Many downstream tests, including Belos and Ifpack2, depend on Pamgen. This blocks adequate Trilinos testing on CUDA.
## Steps to Reproduce
```
$ module list
Currently Loaded Modulefiles:
1) sems-env 3) sems-cmake/3.3.2 5) kokkos-cuda/8.0.44
2) kokkos-env 4) sems-gcc/5.3.0 6) kokkos-openmpi/2.0.1/cuda
```
CMake configuration:
```
-D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON
-D BUILD_SHARED_LIBS:BOOL=ON
-D Trilinos_ENABLE_OpenMP:BOOL=ON
-D Kokkos_ENABLE_OpenMP:BOOL=ON
-D Tpetra_INST_OPENMP:BOOL=ON
-D Trilinos_SHOW_DEPRECATED_WARNINGS:BOOL=ON
-D Trilinos_ENABLE_Fortran:BOOL=OFF
-D TPL_ENABLE_CUDA:BOOL=ON
-D KOKKOS_ARCH="SNB;Kepler35"
-D Kokkos_ENABLE_Cuda:BOOL=ON
-D Kokkos_ENABLE_Cuda_UVM:BOOL=ON
-D Tpetra_INST_CUDA:BOOL=ON
-D Kokkos_ENABLE_Cuda_Lambda:BOOL=ON
-D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS:BOOL=ON
-D CMAKE_CXX_FLAGS:STRING="-Wall"
-D Trilinos_CXX11_FLAGS:STRING="-std=c++11 --expt-extended-lambda"
-D TPL_ENABLE_MKL:BOOL=OFF
-D TPL_ENABLE_Matio:BOOL=OFF
-D TPL_ENABLE_SuperLU:BOOL=OFF
-D TPL_ENABLE_Zlib:BOOL=OFF
-D TPL_ENABLE_Netcdf:BOOL=OFF
-D TPL_ENABLE_HDF5:BOOL=OFF
-D TPL_ENABLE_ParMETIS:BOOL=OFF
-D TPL_ENABLE_Boost:BOOL=OFF
-D TPL_ENABLE_BoostLib:BOOL=OFF
-D TPL_ENABLE_yaml-cpp:BOOL=OFF
-D TPL_ENABLE_MPI:BOOL=ON
```
## Your Environment
- develop branch, commit 400765e21e17dfb995e0f4a2759ce9c5f961b685 (likely not related)
## Related Issues
* Blocks #2115 https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2772Tpetra: Micro-Benchmark MultiVector Update2018-05-23T21:06:28ZJames WillenbringTpetra: Micro-Benchmark MultiVector Update*Created by: csiefer2*
1) Write a Tpetra vs. KokkosBlas vs. Raw Lambda micro-benchmark to compare MultiVector Update performance in serial. Check into Tpetra @csiefer2
2) Study to understand performance @crtrott
Other potentially...*Created by: csiefer2*
1) Write a Tpetra vs. KokkosBlas vs. Raw Lambda micro-benchmark to compare MultiVector Update performance in serial. Check into Tpetra @csiefer2
2) Study to understand performance @crtrott
Other potentially interested parties @jhux2 @srajama1 https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2877Tpetra: Norm2 (dot) is substantially slower than direct calls to TPL BLAS2018-06-04T21:26:32ZJames WillenbringTpetra: Norm2 (dot) is substantially slower than direct calls to TPL BLAS*Created by: jjellio*
When strong scaling an app, we noticed the norm2 (among other things) is 'blowing' up. In this case, the customer has a challenging problem that they want to strong scale, while using 2 processes per node and 16...*Created by: jjellio*
When strong scaling an app, we noticed the norm2 (among other things) is 'blowing' up. In this case, the customer has a challenging problem that they want to strong scale, while using 2 processes per node and 16 threads per process (On Haswell nodes). The problem, is that this results in relatively small work per process for the dense linear algebra, and KokkosKernels::BLAS does not impose a reasonable minimum chunk size for its operations.
Considering that vendor BLAS is Intel's MKL, it isn't surprising that is performs very well. This leaves a reasonable question: Why doesn't Tpetra prefer vendor BLAS?
To support my argument, I modified Belos' MultiVectorTrait::norm2 to call TPL BLAS ddot rather than Tpetra::norm2 (which if you follow the rabbit hole, eventually calls KokkosKernels::BLAS:dot).
I wrote a simple test code that calls MVT::Norm2 a few thousand times in a loop. I profiled this code linked against my modified MVT and the vanilla Trilinos one. (I.e., TPL ddot vs Kokkos dot). For the experiment, I fixed the data per MPI process to 1000 elements (i.e., a very small work size). I then weak scale this perfectly, incrementally filling nodes with 2 processes.
I also profiled the cost of Teuchos::reduceAll, with a single scalar. I ran this with OMP_NUM_THREADS=1 and 16.
## Regular MVT::norm2 and All Reduce (1 thread)
![baseline1-1](https://user-images.githubusercontent.com/21248657/40936039-351eb936-67f7-11e8-9d0d-18496b9b4b3d.png)
## Regular MVT::norm2 and All Reduce (16 thread)
![baseline16-1](https://user-images.githubusercontent.com/21248657/40936049-3cffa28c-67f7-11e8-8103-dd71b81a0545.png)
## All Reduce unaffected by threads
![baselinethreadcomp2-1](https://user-images.githubusercontent.com/21248657/40936072-54554c66-67f7-11e8-98fc-873a66d82fd0.png)
## Using TPL dot with 1 or 16 threads
![modbelos-1](https://user-images.githubusercontent.com/21248657/40936110-7755e310-67f7-11e8-8596-5e21dc86dcd6.png)
## TPL ddot vs Kokkos
![modbelos-2](https://user-images.githubusercontent.com/21248657/40936122-858f7c0c-67f7-11e8-948a-3585142f5505.png)
![modbelos-3](https://user-images.githubusercontent.com/21248657/40936126-899cff22-67f7-11e8-89ed-dd5f65d9d921.png)
While I profiled norm2, the issue is really the underlying call to ddot. In this case, MKL is doing a much better job of throttling back it's thread count. Still, calling the threaded BLAS is an overall loss, and in this case calling a purely serial MKL would have been better.
One option to mitigate the lack of thread scaling we see, is to enforce compute ensure that BLAS operations are called with a meaningful chunk size. A simpler solution, that would reduce the Trilinos code base, would be to call TPL BLAS for Serial/Thread/and OpenMP execution spaces.
This contradicts the recommendation in #2850
@trilinos/tpetra
@trilinos/kokkos-kernels https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2894MueLu: types mismatch in Driver.cpp equilibration2018-06-06T17:13:43ZJames WillenbringMueLu: types mismatch in Driver.cpp equilibration*Created by: lucbv*
@trilinos/muelu
## Expectations
The vector and matrix Scalar types forming a linear system should be consistent.
## Current Behavior
It seems that a call to KokkosBlas::abs is done on a `Scalar` type vector ...*Created by: lucbv*
@trilinos/muelu
## Expectations
The vector and matrix Scalar types forming a linear system should be consistent.
## Current Behavior
It seems that a call to KokkosBlas::abs is done on a `Scalar` type vector and a `magnitude` type vector.
## Motivation and Context
The code is not compiling properly when `std::complex<>` is used
## Definition of Done
- [ ] MueLu Driver compiles
## Possible Solution
My guess is that in MueLu_Driver.cpp on line 167 where the call to KokkosBlas::abs() is made, the two vectors should use the same `Scalar` type. Most likely the magnitude type needs to be replaced by a Scalar type even if this means that in the case of complex numbers only the real part is non zero.
## Steps to Reproduce
Build MueLu with tests and examples on and with `Trilinos_ENABLE_Complex=ON`.
## Your Environment
See builds on cdashhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3456PyTrilinos: Conflicting types in Teuchos_BLAS_wrapper.hpp2018-10-25T16:48:26ZJames WillenbringPyTrilinos: Conflicting types in Teuchos_BLAS_wrapper.hpp*Created by: wfspotz*
@trilinos/pytrilinos
@trilinos/teuchos
@trilinos/kokkos-kernels
## Expectations
I expect to build PyTrilinos without compilation errors
## Current Behavior
I have a wrapper file that `#include`s the he...*Created by: wfspotz*
@trilinos/pytrilinos
@trilinos/teuchos
@trilinos/kokkos-kernels
## Expectations
I expect to build PyTrilinos without compilation errors
## Current Behavior
I have a wrapper file that `#include`s the header `MLAPI_MultiVector.h` which gives the compilation error
/Development/Trilinos/packages/teuchos/numerics/src/Teuchos_BLAS_wrappers.hpp:173:13: error: conflicting types for 'daxpy_'
void PREFIX DAXPY_F77(const int* n, const double* alpha, const double x[], const int* incx, double y[], const int* incy);
^
/Development/Trilinos/packages/teuchos/numerics/src/Teuchos_BLAS_wrappers.hpp:78:21: note: expanded from macro 'DAXPY_F77'
#define DAXPY_F77 F77_BLAS_MANGLE(daxpy,DAXPY)
^
/Development/Trilinos/MPI/packages/teuchos/core/src/Teuchos_config.h:10:37: note: expanded from macro 'F77_BLAS_MANGLE'
#define F77_BLAS_MANGLE(name,NAME) name ## _
^
<scratch space>:23:1: note: expanded from here
daxpy_
^
/Development/Trilinos/packages/kokkos-kernels/src/impl/tpls/KokkosBlas1_axpby_tpl_spec_decl.hpp:49:17: note: previous declaration is here
extern "C" void daxpy_( const int* N, const double* alpha,
## Definition of Done
I can get the PyTrilinos package to build and all of the PyTrilinos tests to pass.
## Possible Solution
I'm not sure why this conflicting type declaration is occurring, but it is clearly related to macro expansion. Based on `git blame` of `Teuchos_BLAS_wrapper.h` I'm hoping either @jwillenbring or @hkthorn might have some idea what the problem could be. PyTrilinos can present some unique configuration issues.
## Steps to Reproduce
If it gets to this, I can help someone set up their environment to build PyTrilinos.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/3574Ifpack2 + Kokkos + complex float : compilation error2019-03-08T14:28:23ZJames WillenbringIfpack2 + Kokkos + complex float : compilation error*Created by: davydden*
## Expectations
Trilinos `trilinos-release-12-14-branch` builds with ETI and complex and float types
## Current Behavior
```
8051 In file included from /Users/davydden/spack/var/spack/stage/trilinos-12...*Created by: davydden*
## Expectations
Trilinos `trilinos-release-12-14-branch` builds with ETI and complex and float types
## Current Behavior
```
8051 In file included from /Users/davydden/spack/var/spack/stage/trilinos-12.14-hpmbucm6pphwdcf3p6hhqcac2cm5qts6/Trilinos/spack-build/packages/ifpack2/src/Ifpack2_BlockTriDiContainer_Serial.cpp:50:
8052 In file included from /Users/davydden/spack/var/spack/stage/trilinos-12.14-hpmbucm6pphwdcf3p6hhqcac2cm5qts6/Trilinos/packages/ifpack2/src/Ifpack2_BlockTriDiContainer_def.hpp:52:
>> 8053 /Users/davydden/spack/var/spack/stage/trilinos-12.14-hpmbucm6pphwdcf3p6hhqcac2cm5qts6/Trilinos/packages/kokkos-kernels/src/batched/KokkosBatched_Util.hpp:195:7: error: static_assert failed "KokkosKernels:: Invalid SIMD<> type."
8054 static_assert( std::is_same<T,bool>::value ||
8055 ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
8056 /Users/davydden/spack/var/spack/stage/trilinos-12.14-hpmbucm6pphwdcf3p6hhqcac2cm5qts6/Trilinos/packages/ifpack2/src/Ifpack2_BlockTriDiContainer_impl.hpp:1568:28: note: in instantiation of template class 'KokkosBatched::Experimental::SIMD<Kokkos::complex<floa
t> >' requested here
8057 B.assign_data( &vector_values(i0+1,0,0) );
8058 ^
8059 /Users/davydden/spack/var/spack/stage/trilinos-12.14-hpmbucm6pphwdcf3p6hhqcac2cm5qts6/Trilinos/packages/ifpack2/src/Ifpack2_BlockTriDiContainer_impl.hpp:1651:9: note: in instantiation of member function 'Ifpack2::BlockTriDiContainerDetails::ExtractAndFactori
zeTridiags<Tpetra::Classes::RowMatrix<std::__1::complex<float>, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >::factorize' requested here
```
## Steps to Reproduce
configure and build with:
```
-DTrilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON
-DTpetra_INST_DOUBLE:BOOL=ON
-DTpetra_INST_INT_LONG:BOOL=ON
-DTpetra_INST_COMPLEX_DOUBLE=ON
-DTpetra_INST_COMPLEX_FLOAT=ON
-DTpetra_INST_FLOAT=ON
-DTpetra_INST_SERIAL=ON
-DTeuchos_ENABLE_COMPLEX=ON
-DTeuchos_ENABLE_FLOAT=ON
```
## Your Environment
macOS Mojave
Apple Clang 10.0.0
gfortran 8.2.0
## Additional Information
full config/build logs:
[logs.zip](https://github.com/trilinos/Trilinos/files/2454254/logs.zip)
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/3664KokkosKernels: missing doxygen documentation2018-10-18T21:24:43ZJames WillenbringKokkosKernels: missing doxygen documentation*Created by: jhux2*
KokkosKernels doesn't have any links to its Doxygen documentation on trilinos.org, at least not that I could find. I think it would be useful to have doxygen available, similarly to what other Trilinos packages prov...*Created by: jhux2*
KokkosKernels doesn't have any links to its Doxygen documentation on trilinos.org, at least not that I could find. I think it would be useful to have doxygen available, similarly to what other Trilinos packages provide, e.g., [Tpetra](https://trilinos.org/docs/dev/packages/tpetra/doc/html/index.html).https://gitlab.osti.gov/jmwille/Trilinos/-/issues/4571Long file names in KokkosKernels have broken git clone of Trilinos on Windows2019-03-07T19:52:38ZJames WillenbringLong file names in KokkosKernels have broken git clone of Trilinos on Windows*Created by: bartlettroscoe*
CC: @trilinos/framework, @trilinos/kokkos-kernels
## Description
It appears that the Trilinos git repo has gotten into a state where it can't even be cloned on Windows systems with git.
The CASL PH...*Created by: bartlettroscoe*
CC: @trilinos/framework, @trilinos/kokkos-kernels
## Description
It appears that the Trilinos git repo has gotten into a state where it can't even be cloned on Windows systems with git.
The CASL PHI INF lead @lefebvrera reported the following error when trying to clone the current Trilinos git repo just a short time ago:
```
G:\raq\scale_dev\dev>git clone https://github.com/trilinos/Trilinos.git Trilinos
Cloning into 'Trilinos'...
remote: Enumerating objects: 47, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (40/40), done.
remote: Total 957120 (delta 17), reused 14 (delta 7), pack-reused 957073
Receiving objects: 100% (957120/957120), 562.05 MiB | 26.56 MiB/s, done.
Resolving deltas: 100% (771153/771153), done.
error: unable to create file packages/kokkos-kernels/src/impl/generated_specializations_cpp/gauss_seidel_numeric/KokkosSparse_gauss_seidel_numeric_eti_spec_inst_Kokkos_complex_double__size_t_int64_t_LayoutRight_Cuda_CudaHostPinnedSpace_CudaHostPinnedSpace.cpp: Filename too long
error: unable to create file packages/kokkos-kernels/src/impl/generated_specializations_cpp/gauss_seidel_symbolic/KokkosSparse_gauss_seidel_symbolic_eti_spec_inst_Kokkos_complex_double__size_t_int64_t_LayoutLeft_Cuda_CudaHostPinnedSpace_CudaHostPinnedSpace.cpp: Filename too long
error: unable to create file packages/kokkos-kernels/src/impl/generated_specializations_cpp/gauss_seidel_symbolic/KokkosSparse_gauss_seidel_symbolic_eti_spec_inst_Kokkos_complex_double__size_t_int64_t_LayoutRight_Cuda_CudaHostPinnedSpace_CudaHostPinnedSpace.cpp: Filename too long
error: unable to create file packages/kokkos-kernels/src/impl/generated_specializations_cpp/gauss_seidel_symbolic/KokkosSparse_gauss_seidel_symbolic_eti_spec_inst_Kokkos_complex_float__size_t_int64_t_LayoutLeft_Cuda_CudaHostPinnedSpace_CudaHostPinnedSpace.cpp: Filename too long
error: unable to create file packages/kokkos-kernels/src/impl/generated_specializations_cpp/gauss_seidel_symbolic/KokkosSparse_gauss_seidel_symbolic_eti_spec_inst_Kokkos_complex_float__size_t_int64_t_LayoutRight_Cuda_CudaHostPinnedSpace_CudaHostPinnedSpace.cpp: Filename too long
Checking out files: 100% (56578/56578), done.
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'
```
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/5014KokkosKernels: KokkosKernels_sparse_ tests timing out on ATDM complex build2019-04-25T18:53:43ZJames WillenbringKokkosKernels: KokkosKernels_sparse_ tests timing out on ATDM complex build*Created by: fryeguy52*
## Bug Report
CC: @trilinos/kokkoskernels, @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [...*Created by: fryeguy52*
## Bug Report
CC: @trilinos/kokkoskernels, @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-sems-rhel7-intel-17.0.1-openmp-complex-shared-debug&field2=testname&compare2=65&value2=KokkosKernels_sparse_&field3=buildstarttime&compare3=84&value3=2019-04-25T00%3A00%3A00&field4=buildstarttime&compare4=83&value4=2019-03-26T00%3A00%3A00) the tests:
* KokkosKernels_sparse_openmp_MPI_1
* KokkosKernels_sparse_serial_MPI_1
are failing timing out in the build:
* Trilinos-atdm-sems-rhel7-intel-17.0.1-openmp-complex-shared-debug
## Current Status on CDash
[The status of these tests for the current testing day](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-sems-rhel7-intel-17.0.1-openmp-complex-shared-debug&field2=testname&compare2=65&value2=KokkosKernels_sparse_&field3=buildstarttime&compare3=84&value3=today&field4=buildstarttime&compare4=83&value4=2%20days%20ago)
## Steps to Reproduce
One should be able to reproduce this failure on with a sems rhel6 environment as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for with a sems rhel6 environment are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#sems-rhel6-environment
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-sems-rhel7-intel-17.0.1-openmp-complex-shared-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_KokkosKernels=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -j8
```
Initial cleanup of new ATDM builds of Trilinos