TpetraCore_BlockCrsMatrix_MPI_4 failing in ATDM cuda builds
Created by: fryeguy52
CC: @trilinos/tpetra, @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
With the merge of PR #4307 on to 'develop' on 2/4/2018, the test TpetraCore_BlockCrsMatrix_MPI_4
seems to be passing in all of the ATDM Trilinos builds on 2/5/2018. Next: Get PR #4326 merged which re-enables this test in the Trilinos CUDA PR build ...
Description
As shown in this query the test:
- TpetraCore_BlockCrsMatrix_MPI_4
is failing in the builds:
- Trilinos-atdm-waterman-cuda-9.2-debug
- Trilinos-atdm-waterman-cuda-9.2-opt
- Trilinos-atdm-waterman-cuda-9.2-release-debug
- Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
- Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
- Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug
It is failing with the following output:
p=0: *** Caught standard std::exception of type 'std::logic_error' :
/home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_def.hpp:2825:
Throw number = 1
Throw test that evaluated to true: numBytesOut != numBytes
unpackRow: numBytesOut = 4 != numBytes = 156.
[FAILED] (0.0877 sec) BlockCrsMatrix_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_write_UnitTest
Location: /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/Block/BlockCrsMatrix.cpp:859
[white23:102556] *** An error occurred in MPI_Allreduce
[white23:102556] *** reported by process [231079937,0]
[white23:102556] *** on communicator MPI_COMM_WORLD
[white23:102556] *** MPI_ERR_OTHER: known error not in list
[white23:102556] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[white23:102556] *** and potentially your MPI job)
@kyungjoo-kim can you see if one of these commits may have caused this?
47f9cbe: Tpetra - fix failing test
Author: Kyungjoo Kim (-EXP) <kyukim@bread.sandia.gov>
Date: Tue Jan 22 11:24:43 2019 -0700
M packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_def.hpp
3e26a55: Tpetra - fix warning error from mismatched virtual functions
Author: Kyungjoo Kim (-EXP) <kyukim@bread.sandia.gov>
Date: Mon Jan 21 11:48:32 2019 -0700
M packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_decl.hpp
M packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_def.hpp
Current Status on CDash
The current status of these tests/builds for the current testing day can be found here
Steps to Reproduce
One should be able to reproduce this failure on ride or white as described in:
More specifically, the commands given for ride or white are provided at:
The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Tpetra=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16