Test TpetraCore_MatrixMatrix_UnitTests_MPI_4 failing in all ATDM Trilinos CUDA builds starting 6/7/2018
Created by: bartlettroscoe
CC: @trilinos/tpetra, @fryeguy52, @kddevin (Data Services Product Lead)
Next Action Status
Merged PR #2122 fixed this in all ATDM builds on 6/11/2018.
Description
As shown in this query for the test TpetraCore_MatrixMatrix_UnitTests_MPI_4 between 6/5/2018 and 6/8/2018, this test started failing in all of the CUDA builds:
- Trilinos-atdm-hansen-shiller-cuda-8.0-debug
- Trilinos-atdm-hansen-shiller-cuda-8.0-opt
- Trilinos-atdm-hansen-shiller-cuda-9.0-debug
- Trilinos-atdm-hansen-shiller-cuda-9.0-opt
- Trilinos-atdm-white-ride-cuda-debug
- Trilinos-atdm-white-ride-cuda-opt
starting on 6/7/2018.
The failing test output, for example, for the build Trilinos-atdm-hansen-shiller-cuda-9.0-opt
on 'hansen' on 2018-06-07T18:36:47 UTC shown at:
showed
p=0: *** Caught standard std::exception of type 'std::runtime_error' :
Invalid SPGEMMAlgorithm name
[FAILED] (6.84 sec) Tpetra_MatMat_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_operations_test_UnitTest
Location: /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-9.0-opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/MatrixMatrix/MatrixMatrix_UnitTests.cpp:789
...
p=0: *** Caught standard std::exception of type 'std::runtime_error' :
Invalid SPGEMMAlgorithm name
Tpetra sparse matrix-matrix multiply: range row test
getIdentityMatrix
Create row Map
Create CrsMatrix
[FAILED] (0.865 sec) Tpetra_MatMat_double_int_longlong_Kokkos_Compat_KokkosCudaWrapperNode_operations_test_UnitTest
Location: /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-9.0-opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/MatrixMatrix/MatrixMatrix_UnitTests.cpp:789
...
The following tests FAILED:
0. Tpetra_MatMat_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_operations_test_UnitTest ...
10. Tpetra_MatMat_double_int_longlong_Kokkos_Compat_KokkosCudaWrapperNode_operations_test_UnitTest ...
Total Time: 11.1 sec
Summary: total = 20, run = 20, passed = 18, failed = 2
End Result: TEST FAILED
All of the other failed test runs showed about identical test output.
When the test passed for the build Trilinos-atdm-hansen-shiller-cuda-9.0-opt
on 'hansen' on 2018-06-05T20:22:00 UTC shown at:
it showed the test output:
Total Time: 29.4 sec
Summary: total = 20, run = 20, passed = 20, failed = 0
End Result: TEST PASSED
Steps to reproduce
Following the instructions at:
one should be able to reproduce this failure on 'hansen', 'shiller', 'white', or 'ride'. Given that 'white' is on the SON and is pretty unloaded, one can reproduce this as described at:
with:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Tpetra=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -VV -R TpetraCore_MatrixMatrix_UnitTests_MPI_4